如果您显示的数据是您拥有的唯一数据,那么马尔可夫链真的很无聊:它是一个线性链,从 A 轮到 B 轮到 C 轮,所有这些状态都连接到一个基本状态(是死亡,或什么)。
您可以直接从您拥有的数据中计算转移概率,因为到达 N 轮的公司数量都是可以到达 N轮的公司(没有替代路径)。前一阶段的死亡概率为 (1 -磷重新设计_ _ _ _ _ _ñ)
In [1]: raw_data = """
...: Company 1: Seed Round, Series A Round
...: Company 2: Seed Round, Series A Round, Series B Round
...: Company 3: Seed Round, Series A Round, Series B Round
...: Company 4: Seed Round, Series A Round, Series B Round, Series C Round
...: Company 5: Seed Round
...: Company 6: Series A Round, Series B Round
...: Company 6: Series A Round
...: """
In [2]: data_lines = raw_data.splitlines()[1:]
In [6]: key_vals = {}
In [12]: for line in data_lines:
key, val = line.split(':')
key = key.strip()
vals = [v.strip() for v in val.split(',')]
key_vals[key] = vals
In [13]: key_vals
Out[13]:
{'Company 1': ['Seed Round', 'Series A Round'],
'Company 2': ['Seed Round', 'Series A Round', 'Series B Round'],
'Company 3': ['Seed Round', 'Series A Round', 'Series B Round'],
'Company 4': ['Seed Round',
'Series A Round',
'Series B Round',
'Series C Round'],
'Company 5': ['Seed Round'],
'Company 6': ['Series A Round']}
In [14]: transitions = ['Seed Round', 'Series A Round', 'Series B Round', 'Series C Round']
In [19]: for transition in transitions:
summed = 0
for company, rounds in key_vals.iteritems():
if transition in rounds:
summed += 1
prob = float(summed) / float(len(key_vals.keys()))
death_prob = 1 - prob
print "From previous to %s: probability %s" % (transition, prob)
print "Death rate at %s: probability %s" % (transition, death_prob)
From previous to Seed Round: probability 0.833333333333
Death rate at Seed Round: probability 0.166666666667
From previous to Series A Round: probability 0.833333333333
Death rate at Series A Round: probability 0.166666666667
From previous to Series B Round: probability 0.5
Death rate at Series B Round: probability 0.5
From previous to Series C Round: probability 0.166666666667
Death rate at Series C Round: probability 0.833333333333
但是,如果您拥有每家公司的更多特征,例如他们在每个阶段收到的金额或他们赚取的利润,那么您可以训练决策树,例如使用sklearn 中的这个实现,告诉您,在简单的话,“如果一家公司在 X 轮融资时至少筹集了 Y 美元,并且至少获得了 Z 美元的利润,那么他们以 0.XX 的概率进入下一轮”。我认为,这就是您的目标。