我正在学习使用 Pythons Scikit-Learnlib 进行聚类。我有一个句子列表(字符串)。我想知道,字符串的长度是否会影响silhouette_score.
例如,我有从 2 个词到 35 个词的句子,我尝试了从 2 到 60 个簇的数量silhouette_score,我得到的最大的是 7 左右。这会影响silhouette_score吗?过滤我的数据是否更好,以便我可以选择单词数更接近的句子,例如,将单词数设置为 20-25 或 5-10?
这就是我的代码的样子:
list_of_comments = data
#cv = TfidfVectorizer(analyzer = 'word', max_features = 6500, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')
cv = CountVectorizer(analyzer = 'word', max_features = 8000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(list_of_comments)
my_list = []
list_of_clusters = []
for i in range(2,35):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
kmeans.fit(x)
my_list.append(kmeans.inertia_)
cluster_labels = kmeans.fit_predict(x)
silhouette_avg = silhouette_score(x, cluster_labels)*100
print(round(silhouette_avg,2))
list_of_clusters.append(silhouette_avg)