数据挖掘 - 用二元组替换单词对 TfIDF 有什么影响？ - 吾爱随笔录

假设我有一个文本文档语料库，我在其上计算了每个文档的 TfIDF 向量。使用语料库的这种稀疏矩阵表示，我可以通过计算文档的 TfIDF 向量之间的余弦相似度来计算文档之间的相似度。

如果我现在根据频率编译一组二元组，并将每个文档中的两个单词的实例替换为二元组的连接，这会产生什么影响

1) 计算 TfIDF 向量？

2）计算文档相似度？

为了明确起见，让我们以python为例：

doc1 = ['the', 'car', 'drove', 'from', 'new','york', 'to', 'washington'] # a single text document
top_n = {('new', 'york'), ('drove', 'from')...} # set of top n bigrams by frequency
new_doc = replace_ngrams(doc1) # function that replaces words with concatenated bigrams
print(new_doc)
>>> ['the', 'car', 'drovefrom', 'newyork', 'to', 'washington']