我正在使用 R 的 topicmodels 包将一大组短文本(10-75 个单词之间)聚集到主题中。在手动审查了几个模型之后,似乎有 20 个真正稳定的主题。然而,我发现真正奇怪的是它们的大小都大致相同!每个主题捕获大约 5% 的标记和 5% 的文本。在代币方面,最小的主题是 4.5%,最大的 5.5%。
有人可以建议这是否是“正常”行为吗?这是我正在使用的代码:
ldafitted <- LDA(sentences.tm, k = K, method = "Gibbs",
control = list(alpha = 0.1, # default is 50/k which would be 2.5. a lower alpha value places more weight on having each document composed of only a few dominant topics
delta = 0.1, # default 0.1 is suggested in Griffiths and Steyvers (2004).
estimate.beta = TRUE,
verbose = 50, # print every 50th draw to screen
seed = 5926696,
save = 0, # can save model every xth iteration
iter = 5000,
burnin = 500,
thin = 5000, # every thin iteration is returned for iter iterations. Standard is same as iter
best = TRUE)) #only the best draw is returned
简而言之:我的问题是,在某些情况下,潜在狄利克雷分配是否合理地将文本聚集在相同大小的主题中?或者如果发生这种情况我应该担心吗?
