我在一些文本文件中有一些关键字的数据集。使用附加功能,我可以访问每个文本文件,并将所有关键字附加到 token_dict,如下所示
token_dict="wrist. overlapping. direction. receptacles. comprising. portion. adjacent. side. hand. receive. adapted. finger. comprising. thumb. ..............................."
通过使用 k-means 聚类,我使用 k=3 对这些数据进行了聚类。现在,我想计算集群中每个数据点与其各自集群质心之间的距离。我试图计算每个数据点和质心之间的欧几里得距离,但不知何故我失败了。我的代码如下:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
style.use('ggplot')
token_dict = []
import glob
path = 'E:\\Project\\*.txt'
files=glob.glob(path)
for file in files:
f=open(file, 'r')
text = f.read()
token_dict.append(text)
vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000, min_df=2, use_idf=True)
#print(X)
km = KMeans(n_clusters=3)
#labels = km.fit_predict(vectorizer)
#print(labels)
X = vectorizer.fit_transform(token_dict).todense()
km.fit(X)
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
# =============================================================================
# cluster_0=np.where(X==0)
# print(cluster_0)
#
# X_cluster_0 = data2D[cluster_0]
# print (X_cluster_0)
# =============================================================================
# =============================================================================
# def euclidean(X1, X2):
# return(X1-X2)
# =============================================================================
# =============================================================================
# distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
# print(distance)
# =============================================================================
# =============================================================================
#
# km.predict()
# =============================================================================
order_centroids = km.cluster_centers_
centers2D = pca.transform(order_centroids)
labels = km.labels_
colors = ["y.", "b.","g."]
for i in range(len(X)):
plt.plot(data2D[i][0], data2D[i][1], colors[labels[i]], markersize=10)
plt.scatter(centers2D[:, 0], centers2D[:, 1], marker='x', s=200, linewidths=3, c='r')
plt.show()
有人可以看到我哪里出错了吗?