数据挖掘 - 使用词频和逆文档频率的一组文档中关键字的权重 - 吾爱随笔录

使用词频和逆文档频率的一组文档中关键字的权重

数据挖掘机器学习数据挖掘聚类统计数据

2022-02-26 20:29:33

我有一组 270 个文档。我有一个分布在这 270 个文档中的关键字列表。我想根据TR-IDF 方法对这些关键字进行排名。根据我的阅读，这种方法有助于确定一个词在特定文档中的重要性，因为一个词的 TR 是一个特定文档的。

有人能告诉我如何扩展这种方法来获得一个单词在所有 270 个文档中的重要性吗？

例如：单词“abc”和“xyz”并分布在所有文档中。因此，我可以为所有文档提供“abc”和“xyz”的文档矩阵（单词和 tr-idf）。现在，我如何确定哪个词总体上是重要的，而不仅仅是在一个文档中？

3个回答

TF - IDF 代表给定文档中的term frequency–inverse document frequency
TF计数frequency of a term / total #terms。对于文档中的每个术语，此值都会发生变化。
IDF计算的比率的对数total document / term appearing in #documents。对于给定的唯一术语，此值是恒定的。一个词的 idf 值越大，它的重要性就越高。
例子：

文档 1：这是一个示例。

文件 2：这是另一个例子。

让我们计算 term = "is"：
TF(is, Document 1) = 1/5
TF(is, Document 2) = 1/4
IDF(is) = log(2/2) = 0
TFIDF = TF*IDF
TFIDF (is, document 1) = (1/5)*0 = 0
TFIDF(is, document 2) = (1/4)*0 = 0
这意味着“is”这个词在文件（语料库）。
让我们考虑 term = "another"
TF(another, document 1) = 0/5 = 0
TF(another, document 2) = 1/4
IDF(another) = log(2/1) = 0.301
TFIDF(another, document 1 ) = 0*0.301 = 0
TFIDF(another, document 2) = (1/4)*0.301
您可以从这两个示例中观察到 TF 因文档而异，而 IDF 是恒定的。
您可以将整个 270 个文档转换为术语文档矩阵。
python中的演示：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

d = pd.Series(['this is a sample example','this is another example'])
df = pd.DataFrame(d)
tfidf_vectorizer = TfidfVectorizer(analyzer='word', min_df=0)
# if you want, say only top 2 features(terms)
# tfidf_vectorizer = TfidfVectorizer(analyzer='word', min_df=0, max_features=2, max_df = 3)
# Terms with given below:
# occurred in too many documents (max_df, tfidf score = 3)  
# occurred in too few documents (min_df, tfidf score = 0)
# cut off by feature selection (max_features, tfidf score = 2).
tfidf = tfidf_vectorizer.fit_transform(df[0])
print tfidf_vectorizer.vocabulary_
# output: {u'this': 4, u'sample': 3, u'is': 2, u'example': 1, u'another': 0}
print tfidf_vectorizer.idf_
# output(constant): [ 1.40546511  1.          1.          1.40546511  1.        ]
print tfidf
# output: 
#(0, 1)        0.448320873199    Document 1, term = example
#(0, 3)        0.630099344518    Document 1, term = sample
#(0, 2)        0.448320873199    Document 1, term = is
#(0, 4)        0.448320873199    Document 1, term = this
#(1, 0)        0.630099344518    Document 2, term = another
#(1, 1)        0.448320873199    Document 2, term = example
#(1, 2)        0.448320873199    Document 2, term = is
#(1, 4)        0.448320873199    Document 2, term = this

来源：
https ://en.wikipedia.org/wiki/Tf%E2%80%93idf
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

TF-IDF 表示相对重要性：与整个语料库相比，特定文本中的术语的重要性。因此，使用 TF-IDF 估计所有 270 个文档中的单词重要性需要比较语料库。您可以使用此处列出的“常用”语料库之一，或尝试查找特定领域的语料库。然后将您的 270 个文档视为单个文本，并根据比较语料库计算您感兴趣的单词的 TF-IDF 分数。

您首先创建倒排索引或发布列表。然后，使用术语频率和文档频率，您可以使用公式计算 tf idf 。有关更多详细信息，请查看此博客。 $tf* log({N \over df})$

其它你可能感兴趣的问题

上一篇文本预处理的最佳工具，包括标记化、词形还原、停用词去除、特征向量提取？下一篇反向传播推导问题