机器算法验证 - 使用潜在 Dirichlet 分配的主题预测 - 吾爱随笔录

使用潜在 Dirichlet 分配的主题预测

机器算法验证文本挖掘主题模型

2022-02-09 16:40:14

我在文档语料库上使用了 LDA 并找到了一些主题。我的代码的输出是两个包含概率的矩阵；一个文档主题概率和另一个单词主题概率。但我实际上不知道如何使用这些结果来预测新文档的主题。我正在使用吉布斯采样。有谁知道怎么做？谢谢

1个回答

我会尝试“折叠”。这是指获取一个新文档，将其添加到语料库中，然后仅对该新文档中的单词进行Gibbs 采样，保持旧文档的主题分配相同。这通常会快速收敛（可能是 5-10-20 次迭代），并且您不需要对旧语料库进行采样，因此它也运行得很快。最后，您将为新文档中的每个单词分配主题。这将为您提供该文档中主题的分布。

在您的 Gibbs 采样器中，您可能有类似于以下代码的内容：

// This will initialize the matrices of counts, N_tw (topic-word matrix) and N_dt (document-topic matrix)
for doc = 1 to N_Documents
    for token = 1 to N_Tokens_In_Document
       Assign current token to a random topic, updating the count matrices
    end
end

// This will do the Gibbs sampling
for doc = 1 to N_Documents
    for token = 1 to N_Tokens_In_Document
       Compute probability of current token being assigned to each topic
       Sample a topic from this distribution
       Assign the token to the new topic, updating the count matrices
    end
end

折叠是相同的，除了您从现有矩阵开始，将新文档的标记添加到它们，并仅对新标记进行采样。IE：

Start with the N_tw and N_dt matrices from the previous step

// This will update the count matrices for folding-in
for token = 1 to N_Tokens_In_New_Document
   Assign current token to a random topic, updating the count matrices
end

// This will do the folding-in by Gibbs sampling
for token = 1 to N_Tokens_In_New_Document
   Compute probability of current token being assigned to each topic
   Sample a topic from this distribution
   Assign the token to the new topic, updating the count matrices
end

如果您使用标准 LDA，则整个文档不太可能由一个主题生成。所以我不知道计算文档在一个主题下的概率有多大用处。但是，如果您仍然想这样做，那很容易。从你得到的两个矩阵中，你可以计算 $p^i_w$ , 单词的概率 $w$ 在主题 $i$ . 拿上你的新文件；假设 $j$ '第一个词是 $w_j$ . 给定主题，单词是独立的，所以概率只是

\prod_{j} p_{w_{j}}^{i}

$\prod_j p^i_{w_j}$ （请注意，您可能需要在日志空间中计算它）。

其它你可能感兴趣的问题

上一篇如何对 R 中既没有正态性也没有方差相等的数据运行双向 ANOVA？下一篇为什么总是使用均值 0 和标准差 1 分布？