我有一个数据框:
df = pd.DataFrame({'docs': ['gamma alfa beta beta epsilon', 'beta gamma eta',], 'labels': ['alfa alfa beta', 'gamma fi']})
我使用计数矢量化器:
import numpy as np
import pandas as pd
from itertools import chain
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vocab_docs = set(chain(*[i.split() for i in df['docs'].unique()]))
cv_docs = CountVectorizer(vocabulary=vocab_docs)
cv_docs_s = cv_docs.fit_transform(df['docs'])
我做 TFIDF:
tfidf_docs = TfidfVectorizer(vocabulary=vocab_docs)
tfidf_docs_s = tfidf_docs.fit_transform(df['docs'])
# tfidf docs
tfidf_docs_s = tfidf_docs_s.todense()
但我看到结果不同:
test = np.multiply(cv_docs_s.todense(), tfidf_docs.idf_)
test != tfidf_docs_s
为什么 CountVectorizer * TfidfVectorizer.idf_ 的结果与 TfidfVectorizer.fit_transform() 的结果不同?