卡住对大小狗实施 k 手段 - 狡猾的结果

数据挖掘 机器学习 Python k-均值 无监督学习
2022-03-13 23:30:59

我的算法不起作用。代码似乎很有意义,但我只是不相信结果。我觉得质心应该更多地在数据中,有点像中心,但无论我设置算法多少次迭代,它们都不会这样做。有人可以运行我的代码并给我一些指示吗?我已经标记了代码,以便您可以理解它。一些要点:

  • k = 2; 2个质心
  • centroid1 = 深红色,因此该集群中的数据点为浅红色
  • 质心 2 = 深绿色,因此该集群中的数据点为浅绿色
  • 狗是“大”还是“小”取决于它们的身高和体重
  • 欧几里得距离用于决定属于哪个簇
  • 存储在标有“x”的字典中的原始/主要数据;根据它们与质心的距离,它们被移动到字典“cluster1”和“cluster2”中

代码:

from matplotlib import pyplot as plt
import random, math

plt.title("Big Dogs and Small Dogs")
plt.ylabel("Height (cm)")
plt.xlabel("Weight (kg)")

def euclideandis(x, y, a, b):   # Formula for euclidean distance
    return math.sqrt((x-a)**2 + (x-a)**2)

x = {4.4:31,
     3.2:19,
     4.6:32,        # Data- key = weight, value = height
     4:25,
     4.1:29,
     2.90:17,
     2.9:11}

centroid1 = [[random.uniform(min(x.keys()), max(x.keys())), random.uniform(max(x.values()), min(x.values()))]]   # create random centroids that are between the datapoints
centroid2 = [[random.uniform(min(x.keys()), max(x.keys())), random.uniform(max(x.values()), min(x.values()))]]

for j in xrange(1):   # keep updating colours, position of centroids
    cluster1 = {}       # Everything in cluster 1 is closest to the dark red spot, therefore gets included in this dict, and gets scattered in light red
    cluster2 = {}       # Everything in cluster 2 is closest to the dark green spot, therefore gets included in this dict, and gets scattered in light green
    for key in x:
        temp1 = 0       # works out euclidean distance for all data points in x, then compares them
        temp2 = 0
        temp1 = euclideandis(key, x[key], centroid1[0][0], centroid1[0][1]) # Dis betw data point, red centroid
        temp2 = euclideandis(key, x[key], centroid2[0][0], centroid2[0][1]) # Dis betw data point, green centroid
        if temp1 < temp2:
            cluster1[key] = x[key]      # if the euclidean distance between datapoint and red spot,
                                        # smaller than the euclidean distance between datapoint and green spot,
                                        # add the point to the red cluster, else add to green cluster
        else:
            cluster2[key] = x[key]

    centroid1 = [[0, 0]]        # Centroids reset as they will be changed  
    centroid2 = [[0, 0]]

    iterable = 0
    for key in cluster1:        # works out mean coordinates of each cluster and changes the centroids coordinates to this
        iterable = iterable + key
    centroid1[0][0] = iterable/len(cluster1)
    iterable = 0
    for key in cluster1:
        iterable = iterable + cluster1[key]
    centroid1[0][1] = iterable/len(cluster1)
    iterable = 0
    for key in cluster2:
        iterable = iterable + key
    centroid2[0][0] = iterable/len(cluster2)
    iterable = 0
    for key in cluster2:
        iterable = iterable + cluster2[key]
    centroid2[0][1] = iterable/len(cluster2)



plt.scatter(cluster1.keys(), cluster1.values(), color = "red")      # scatters everythning
plt.scatter(cluster2.keys(), cluster2.values(), color = "lime")
plt.scatter(centroid1[0][0], centroid1[0][1], color = "maroon") 
plt.scatter(centroid2[0][0], centroid1[0][1], color = "green")
plt.show()

总而言之,我的程序没有明显的错误,只是我不太相信结果

100k 次迭代后的图; 质心不应该更多地位于数据点的中间吗?

(上)为什么在 100k 次迭代后质心不在数据点的中间?红色的是但是绿色的不是

1个回答

质心可能是正确的,你有一个显示错误

线

plt.scatter(centroid2[0][0], centroid1[0][1], color = "green")

应该

plt.scatter(centroid2[0][0], centroid2[0][1], color = "green")

这只是当您从头开始实施算法以学习它时发生的事情之一。. . 我敢打赌你已经花了几个小时查看脚本的顶部 :-)