我是一个初学者,我正在尝试对多句文本进行聚类,但我的结果很糟糕。有什么建议可以提高我的成绩吗?
import pandas
import pprint
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.feature_extraction.text import TfidfVectorizer
dataset = pandas.read_csv('text.csv', encoding = 'utf-8')
comments = dataset['comments']
comments_list = comments.values.tolist()
vetorize = TfidfVectorizer()
X = vetorize.fit_transform(comments_list)
clusters_number = 6
model = KMeans(n_clusters = clusters_number, init = 'k-means++', max_iter = 300, n_init = 1)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
clusters = {}
for verbatim, label in zip(verbatim_list, labels):
try:
clusters[str(label)].append(verbatim)
except:
clusters[str(label)] = [verbatim]
pprint.pprint(clusters)
#Top terms for cluster
print("Top termos par cluster:")
ordem_centroides = model.cluster_centers_.argsort()[:, ::-1]
termos = vetorizar.get_feature_names()
for i in range(clusters_number):
print ("Cluster %d:" % i,)
for ind in ordem_centroides[i, :10]:
print (' %s' % termos[ind],)
print()
我在不同的集群中有许多不同的主题。我预处理了我的数据(停用词,小写,我删除了点...)。但我仍然在一个集群中“喜欢取消订单”,在另一个集群中“喜欢取消订单”。实际上,理想的情况是将所有“取消订单”加入一个集群中。