数据挖掘 - 卡住对大小狗实施 k 手段 - 狡猾的结果 - 吾爱随笔录

我的算法不起作用。代码似乎很有意义，但我只是不相信结果。我觉得质心应该更多地在数据中，有点像中心，但无论我设置算法多少次迭代，它们都不会这样做。有人可以运行我的代码并给我一些指示吗？我已经标记了代码，以便您可以理解它。一些要点：

k = 2; 2个质心
centroid1 = 深红色，因此该集群中的数据点为浅红色
质心 2 = 深绿色，因此该集群中的数据点为浅绿色
狗是“大”还是“小”取决于它们的身高和体重
欧几里得距离用于决定属于哪个簇
存储在标有“x”的字典中的原始/主要数据；根据它们与质心的距离，它们被移动到字典“cluster1”和“cluster2”中

代码：

from matplotlib import pyplot as plt
import random, math

plt.title("Big Dogs and Small Dogs")
plt.ylabel("Height (cm)")
plt.xlabel("Weight (kg)")

def euclideandis(x, y, a, b):   # Formula for euclidean distance
    return math.sqrt((x-a)**2 + (x-a)**2)

x = {4.4:31,
     3.2:19,
     4.6:32,        # Data- key = weight, value = height
     4:25,
     4.1:29,
     2.90:17,
     2.9:11}

centroid1 = [[random.uniform(min(x.keys()), max(x.keys())), random.uniform(max(x.values()), min(x.values()))]]   # create random centroids that are between the datapoints
centroid2 = [[random.uniform(min(x.keys()), max(x.keys())), random.uniform(max(x.values()), min(x.values()))]]

for j in xrange(1):   # keep updating colours, position of centroids
    cluster1 = {}       # Everything in cluster 1 is closest to the dark red spot, therefore gets included in this dict, and gets scattered in light red
    cluster2 = {}       # Everything in cluster 2 is closest to the dark green spot, therefore gets included in this dict, and gets scattered in light green
    for key in x:
        temp1 = 0       # works out euclidean distance for all data points in x, then compares them
        temp2 = 0
        temp1 = euclideandis(key, x[key], centroid1[0][0], centroid1[0][1]) # Dis betw data point, red centroid
        temp2 = euclideandis(key, x[key], centroid2[0][0], centroid2[0][1]) # Dis betw data point, green centroid
        if temp1 < temp2:
            cluster1[key] = x[key]      # if the euclidean distance between datapoint and red spot,
                                        # smaller than the euclidean distance between datapoint and green spot,
                                        # add the point to the red cluster, else add to green cluster
        else:
            cluster2[key] = x[key]

    centroid1 = [[0, 0]]        # Centroids reset as they will be changed  
    centroid2 = [[0, 0]]

    iterable = 0
    for key in cluster1:        # works out mean coordinates of each cluster and changes the centroids coordinates to this
        iterable = iterable + key
    centroid1[0][0] = iterable/len(cluster1)
    iterable = 0
    for key in cluster1:
        iterable = iterable + cluster1[key]
    centroid1[0][1] = iterable/len(cluster1)
    iterable = 0
    for key in cluster2:
        iterable = iterable + key
    centroid2[0][0] = iterable/len(cluster2)
    iterable = 0
    for key in cluster2:
        iterable = iterable + cluster2[key]
    centroid2[0][1] = iterable/len(cluster2)



plt.scatter(cluster1.keys(), cluster1.values(), color = "red")      # scatters everythning
plt.scatter(cluster2.keys(), cluster2.values(), color = "lime")
plt.scatter(centroid1[0][0], centroid1[0][1], color = "maroon") 
plt.scatter(centroid2[0][0], centroid1[0][1], color = "green")
plt.show()

总而言之，我的程序没有明显的错误，只是我不太相信结果

（上）为什么在 100k 次迭代后质心不在数据点的中间？红色的是但是绿色的不是