作为提高我在 PyMC(Python 的马尔可夫链蒙特卡洛库)中技能的练习,我正在尝试实现潜在 Dirichlet 分配,如下所述:https ://en.wikipedia.org/wiki/Latent_Dirichlet_allocation 。
该模型可以紧凑地描述为
我想出了以下玩具代码:
import numpy as np
import pymc as pm
K = 2 # number of topics
V = 4 # number of words
D = 3 # number of documents
data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])
alpha = np.ones(K)
beta = np.ones(V+1)
theta = pm.Container([pm.Dirichlet("theta_%s" % i, theta=alpha) for i in range(D)])
phi = pm.Container([pm.Dirichlet("phi_%s" % k, theta=beta) for k in range(K)])
Wd = [len(doc) for doc in data]
z = pm.Container([pm.Categorical('z_%i' % d,
p = theta[d],
size=Wd[d],
value=np.random.randint(K, size=Wd[d]),
verbose=1)
for d in range(D)])
w = pm.Container([pm.Categorical("w_%i" % d,
p = pm.Lambda('phi_z_%i' % d, lambda z=z, phi=phi: [phi[z[d][i]] for i in range(Wd[d])]),
value=data[d],
observed=True,
verbose=1)
for d in range(D)])
model = pm.Model([theta, phi, z, w])
mcmc = pm.MCMC(model)
mcmc.sample(100, burn=10)
棘手的部分在. 鉴于采样的输出,我一定做错了,因为模型没有收敛,并且我收到许多关于 categorical_like 中概率的警告,但总和不等于 1。
周围有 PyMC 专家可以解释这一切吗?