机器算法验证 - 在一个主题中选择单词，LDA 主题的哪个截止点？ - 吾爱随笔录

我的问题是 LDA 主题的良好截止阈值是多少？

我使用了这篇博文中的代码在 Python 中使用潜在 Dirichlet 分配进行主题建模

在该代码中，作者显示了每个主题中的前 8 个单词，但这是最佳选择吗？

对于每个主题分布，每个词都有一个概率，所有词的概率加起来为 1.0

我编写了这段代码来打印到一个 epsilon 阈值：

eps=0.01
for i, topic_dist in enumerate(topic_word):         
    wordindex=np.argsort(topic_dist)[::-1] #rev sort
    w=topic_dist[wordindex] ## this is the length of all the unique words 4258
    words=[np.array(vocab)[wordindex[j]] for j in range(min(n_top_words,len(wordindex))) if w[j]>eps ]
    weights=['{:.3f}'.format(w[j]) for j in range(min(n_top_words,len(wordindex))) if w[j]>eps ]
    print('Topic {}: {}; {}'.format(i, ', '.join(words),', '.join(weights)))

查看另一个库gensim LdaModel，看起来 LDA 最初很可能没有像这样的概率总和为 1.0，并且它们被归一化，见下文：

def show_topic(self, topicid, topn=10):
    topic = self.state.get_lambda()[topicid]
    topic = topic / topic.sum() # normalize to probability dist
    ...

使用 Python运行示例代码Latent Dirichlet Allocation (LDA)并调用 get_lambda，可以看到 lambda 值有时高于 1.0。

ldamodel.state.get_lambda()[1]

给出：

array([ 1.48214337,  1.48168697,  0.50442377,  0.50399559,  0.50400832,
    0.5047193 ,  0.50375875,  0.50376053,  1.50224118,  0.50376574,
    0.5037527 ,  0.50377459,  0.50376621,  1.49831418,  1.49832577,
    1.49831855,  1.49831883,  1.49831596,  1.51053093,  3.49684196,
    1.49832204,  1.49832512,  0.50316907,  0.50321838,  0.50328253,
    0.50319543,  0.50317986,  0.50318815,  0.50314213,  0.5031702 ,
    1.49635267,  1.49634655])

什么是最好的 eps 选择？不规范化概率分布并在截止值中使用原始值会更好吗？是否最好在每个主题中使用最大概率值并以此为基础？

在我的实际数据集中，有时 0.01 的 eps 实际上会创建一个无字主题！

更新

玩不同数量的主题时，我注意到如果我有 2 个带有load_reuters数据的主题，我会得到 eps=0.01

Topic 0: ; 
Topic 1: pope; 0.013

我相信主题 0 可以解释为其他所有内容，或者需要更多主题。

arr=[]
for n in (range(2,50)):
  model = lda.LDA(n_topics=n, n_iter=20, random_state=1)
  model.fit(X)
  topic_word = model.topic_word_  # model.components_ also works
  arr.append(
    (min([max(topic_word[i]) for i in range(model.n_topics)]),
     max([max(topic_word[i]) for i in range(model.n_topics)])))
plt.plot(arr)
...

所以看看这张图表，如果 n 低于 5 就太低了，并且在 20 之后的某个时间变平......