计算集群的每个数据点到它们各自的集群质心之间的距离

数据挖掘 机器学习 Python k-均值
2021-09-22 13:55:41

我在一些文本文件中有一些关键字的数据集。使用附加功能,我可以访问每个文本文件,并将所有关键字附加到 token_dict,如下所示

token_dict="wrist. overlapping. direction. receptacles. comprising. portion. adjacent. side. hand. receive. adapted. finger. comprising. thumb. ..............................."

通过使用 k-means 聚类,我使用 k=3 对这些数据进行了聚类。现在,我想计算集群中每个数据点与其各自集群质心之间的距离。我试图计算每个数据点和质心之间的欧几里得距离,但不知何故我失败了。我的代码如下:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
style.use('ggplot')
token_dict = []
import glob   
path = 'E:\\Project\\*.txt'   
files=glob.glob(path)   
for file in files:     
    f=open(file, 'r')
    text = f.read()
    token_dict.append(text)
vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000, min_df=2, use_idf=True)

#print(X)

km = KMeans(n_clusters=3)
#labels = km.fit_predict(vectorizer)
#print(labels)
X = vectorizer.fit_transform(token_dict).todense()
km.fit(X)
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
# =============================================================================
# cluster_0=np.where(X==0)
# print(cluster_0)
# 
# X_cluster_0 = data2D[cluster_0]
# print (X_cluster_0)
# =============================================================================

# =============================================================================
# def euclidean(X1, X2):
#     return(X1-X2)
# =============================================================================

# =============================================================================
# distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
# print(distance)
# =============================================================================
# =============================================================================
# 
# km.predict()
# =============================================================================
order_centroids = km.cluster_centers_
centers2D = pca.transform(order_centroids)
labels = km.labels_
colors = ["y.", "b.","g."]
for i in range(len(X)):
    plt.plot(data2D[i][0], data2D[i][1], colors[labels[i]], markersize=10)

plt.scatter(centers2D[:, 0], centers2D[:, 1], marker='x', s=200, linewidths=3, c='r')
plt.show()

有人可以看到我哪里出错了吗?

2个回答
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
        distances = [np.sqrt((x-cx)**2+(y-cy)**2) for (x, y) in data[cluster_labels == i_centroid]]
        return distances
clusters=km.fit_predict(data2D)
centroids = km.cluster_centers_

distances = []
for i, (cx, cy) in enumerate(centroids):
    mean_distance = k_mean_distance(data2D, cx, cy, i, clusters)
    distances.append(mean_distance)

print(distances)

使用这个功能我解决了我的问题

我认为这是一个更优雅的解决方案。

首先,km.fit_transform()(或 km.transform())为您提供到所有集群的所有距离。然后,您可以只总结最小值 - 即到各自最近集群的距离。

km = KMeans(n_clusters=3)
alldistances = km.fit_transform(data2D)
totalDistance = np.min(corpus.clusterMatrix, axis=1).sum()