数据挖掘 - Word2Vec 的更好输入是什么？ - 吾爱随笔录

这更像是一个一般的 NLP 问题。什么是合适的输入来训练词嵌入，即 Word2Vec？属于一篇文章的所有句子都应该是语料库中的一个单独文档吗？还是每篇文章都应该是所述语料库中的文档？这只是一个使用 python 和 gensim 的例子。

语料库按句子拆分：

SentenceCorpus = [["first", "sentence", "of", "the", "first", "article."],
                  ["second", "sentence", "of", "the", "first", "article."],
                  ["first", "sentence", "of", "the", "second", "article."],
                  ["second", "sentence", "of", "the", "second", "article."]]

语料库按文章划分：

ArticleCorpus = [["first", "sentence", "of", "the", "first", "article.",
                  "second", "sentence", "of", "the", "first", "article."],
                 ["first", "sentence", "of", "the", "second", "article.",
                  "second", "sentence", "of", "the", "second", "article."]]

在 Python 中训练 Word2Vec：

from gensim.models import Word2Vec

wikiWord2Vec = Word2Vec(ArticleCorpus)