PyMC 中的潜在狄利克雷分配

机器算法验证 Python pymc
2022-03-29 16:46:50

作为提高我在 PyMC(Python 的马尔可夫链蒙特卡洛库)中技能的练习,我正在尝试实现潜在 Dirichlet 分配,如下所述:https ://en.wikipedia.org/wiki/Latent_Dirichlet_allocation 。

该模型可以紧凑地描述为

ϕk=1KDirichletV(β)θd=1MDirichletK(α)zd=1M,w=1NdCategoricalK(θd)wd=1M,w=1NdCategoricalV(ϕzdw)

我想出了以下玩具代码:

import numpy as np
import pymc as pm

K = 2 # number of topics
V = 4 # number of words
D = 3 # number of documents

data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])

alpha = np.ones(K)
beta = np.ones(V+1)

theta = pm.Container([pm.Dirichlet("theta_%s" % i, theta=alpha) for i in range(D)])
phi = pm.Container([pm.Dirichlet("phi_%s" % k, theta=beta) for k in range(K)])
Wd = [len(doc) for doc in data]

z = pm.Container([pm.Categorical('z_%i' % d, 
                             p = theta[d], 
                             size=Wd[d],
                             value=np.random.randint(K, size=Wd[d]),
                             verbose=1)
              for d in range(D)])


w = pm.Container([pm.Categorical("w_%i" % d,
                             p = pm.Lambda('phi_z_%i' % d, lambda z=z, phi=phi: [phi[z[d][i]] for i in range(Wd[d])]),
                             value=data[d], 
                             observed=True, 
                             verbose=1)
              for d in range(D)])

model = pm.Model([theta, phi, z, w])
mcmc = pm.MCMC(model)
mcmc.sample(100, burn=10)

棘手的部分在wd=1M,w=1NdCategoricalV(ϕzdw). 鉴于采样的输出,我一定做错了,因为模型没有收敛,并且我收到许多关于 categorical_like 中概率的警告,但总和不等于 1。

周围有 PyMC 专家可以解释这一切吗?

1个回答

定义w时,p参数必须是双精度列表,而不是双精度列表。这意味着您必须w为每个文档中的每个单词定义一个变量。它还有助于使用CompletedDirichlet函数“完成”Dirichlet 变量。这是工作代码:

import numpy as np
import pymc as pm

K = 2 # number of topics
V = 4 # number of words
D = 3 # number of documents

data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])

alpha = np.ones(K)
beta = np.ones(V)

theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
Wd = [len(doc) for doc in data]

z = pm.Container([pm.Categorical('z_%i' % d, 
                     p = theta[d], 
                     size=Wd[d],
                     value=np.random.randint(K, size=Wd[d]))
                  for d in range(D)])

# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i),
                    p = pm.Lambda('phi_z_%i_%i' % (d,i), 
                              lambda z=z[d][i], phi=phi: phi[z]),
                    value=data[d][i], 
                    observed=True)
                  for d in range(D) for i in range(Wd[d])])

model = pm.Model([theta, phi, z, w])
mcmc = pm.MCMC(model)
mcmc.sample(100)