TF - IDF 代表 给定文档中的term frequency–inverse document frequency
TF计数frequency of a term / total #terms。对于文档中的每个术语,此值都会发生变化。
IDF计算 的比率的对数total document / term appearing in #documents。对于给定的唯一术语,此值是恒定的。一个词的 idf 值越大,它的重要性就越高。
例子:
文档 1:这是一个示例。
文件 2:这是另一个例子。
让我们计算 term = "is":
TF(is, Document 1) = 1/5
TF(is, Document 2) = 1/4
IDF(is) = log(2/2) = 0
TFIDF = TF*IDF
TFIDF (is, document 1) = (1/5)*0 = 0
TFIDF(is, document 2) = (1/4)*0 = 0
这意味着“is”这个词在文件(语料库)。
让我们考虑 term = "another"
TF(another, document 1) = 0/5 = 0
TF(another, document 2) = 1/4
IDF(another) = log(2/1) = 0.301
TFIDF(another, document 1 ) = 0*0.301 = 0
TFIDF(another, document 2) = (1/4)*0.301
您可以从这两个示例中观察到 TF 因文档而异,而 IDF 是恒定的。
您可以将整个 270 个文档转换为术语文档矩阵。
python中的演示:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
d = pd.Series(['this is a sample example','this is another example'])
df = pd.DataFrame(d)
tfidf_vectorizer = TfidfVectorizer(analyzer='word', min_df=0)
# if you want, say only top 2 features(terms)
# tfidf_vectorizer = TfidfVectorizer(analyzer='word', min_df=0, max_features=2, max_df = 3)
# Terms with given below:
# occurred in too many documents (max_df, tfidf score = 3)
# occurred in too few documents (min_df, tfidf score = 0)
# cut off by feature selection (max_features, tfidf score = 2).
tfidf = tfidf_vectorizer.fit_transform(df[0])
print tfidf_vectorizer.vocabulary_
# output: {u'this': 4, u'sample': 3, u'is': 2, u'example': 1, u'another': 0}
print tfidf_vectorizer.idf_
# output(constant): [ 1.40546511 1. 1. 1.40546511 1. ]
print tfidf
# output:
#(0, 1) 0.448320873199 Document 1, term = example
#(0, 3) 0.630099344518 Document 1, term = sample
#(0, 2) 0.448320873199 Document 1, term = is
#(0, 4) 0.448320873199 Document 1, term = this
#(1, 0) 0.630099344518 Document 2, term = another
#(1, 1) 0.448320873199 Document 2, term = example
#(1, 2) 0.448320873199 Document 2, term = is
#(1, 4) 0.448320873199 Document 2, term = this
来源:
https ://en.wikipedia.org/wiki/Tf%E2%80%93idf
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html