我是分析集群文本的新手,我正在使用 Goodreads API 来获取书籍概要。我的目标是对类似的书籍进行分组,例如:
- 政治
- 音乐
- 传记等...
虽然 Goodreads 提供流派,但我想使用概要并为此使用文本。假设我会得到 N 本书的概要,如下所示:
<description>
<![CDATA[
<b>Alternate cover edition can be found <a href="https://www.goodreads.com/book/show/10249685-dune" rel="nofollow">here</a>. </b> and <a href="https://www.goodreads.com/book/show/11273438-dune" rel="nofollow">here</a><br /><br />Here is the novel that will be forever considered a triumph of the imagination. Set on the desert planet Arrakis, <b>Dune</b> is the story of the boy Paul Atreides, who would become the mysterious man known as Muad'Dib. He would avenge the traitorous plot against his noble family--and would bring to fruition humankind's most ancient and unattainable dream.<br />A stunning blend of adventure and mysticism, environmentalism and politics, Dune won the first Nebula Award, shared the Hugo Award, and formed the basis of what it undoubtedly the grandest epic in science fiction.
]]>
</description>
我读过 cosinesimilarity 和新的 google NLP。但我想从这个开始:
- 表示书籍描述(特征,通常是带有 TF-IDF 的词袋)
- 计算两本书之间的相似度(余弦相似度)
问题:
- 在所有书籍之间创建余弦相似度矩阵的最有效算法是什么(N)
- 如何根据以上内容将书籍聚集在一起?
任何其他想法都会很棒。