假设我有一篇文章,我想根据其内容将概率分配给不同的流派(类)。例如
文字#1:喜剧 10%,恐怖 50%,浪漫 1%
文字#2:喜剧 40%,恐怖 3%,浪漫 30%
我们在每个类别中都给出了关键字,我们通过这些关键字进行比较。下面是更好地解释这种情况的代码
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Comedy
keywords_1 = ['funny', 'amusing', 'humorous', 'hilarious', 'jolly']
# Horror
keywords_2 = ['horror', 'fear', 'shock', 'panic', 'scream']
# Romance
keywords_3 = ['romantic', 'intimate', 'passionate', 'love', 'fond']
text = ('funny hilarious fear passionate')
cv1 = CountVectorizer(vocabulary = keywords_1)
data = cv1.fit_transform([text]).toarray()
vec1 = np.array(data)
vec2 = np.array([[1, 1, 1, 1, 1]])
print(cosine_similarity(vec1, vec2))
cv2 = CountVectorizer(vocabulary = keywords_2)
data = cv2.fit_transform([text]).toarray()
vec1 = np.array(data)
vec2 = np.array([[1, 1, 1, 1, 1]])
print(cosine_similarity(vec1, vec2))
cv3 = CountVectorizer(vocabulary = keywords_3)
data = cv3.fit_transform([text]).toarray()
vec1 = np.array(data)
vec2 = np.array([[1, 1, 1, 1, 1]])
print(cosine_similarity(vec1, vec2))
这种方法的问题是vocabulary
inCountVectorizer()
没有考虑文本中单词的不同词类(名词、动词、形容词、副词、复数等)。例如,假设我们有如下关键字列表
keywords_1 = [(...), ('amusement', 'amusements', 'amuse', 'amuses', 'amused', 'amusing'), (...), ('hilarious', 'hilariously') (...)]
我们要计算相似度如下
cv1 = CountVectorizer(vocabulary = keywords_1)
data = cv1.fit_transform([text]).toarray()
vec1 = np.array(data) # [[f1, f2, f3, f4, f5]]) # fi is the count of number of keywords matched in a sublist
vec2 = np.array([[n1, n2, n3, n4, n5]]) # ni is the size of sublist
print(cosine_similarity(vec1, vec2))
我们如何修改上面的代码来捕捉这个场景。任何建议表示赞赏。