K-means:为什么最小化 WCSS 就是最大化集群之间的距离?
机器算法验证
聚类
k-均值
距离
平方和
欧几里得
2022-03-30 19:11:21
2个回答
K-means 是关于方差分析范式的。ANOVA - 单变量和多变量 - 是基于这样一个事实,即关于大质心的平方偏差总和由关于组质心的这种散布和关于大质心的散布组成:SStotal=SSwithin+SSbetween。因此,如果SSwithin被最小化,那么SSbetween被最大化。
已知某些点关于其质心(算术平均值)的偏差 SS 与点之间的总平方欧几里德距离直接相关:与质心的平方偏差之和等于两两平方欧几里德距离之和除以数量的点。(这是centroid的三角特性的直接扩展。并且这种关系也被用于距离矩阵的双中心化。)
因此,说“质心(作为点)的 SSbetween 被最大化”是说“质心之间的(加权)平方距离集被最大化”的别名。
Ni注意:在 SSbetween 每个质心都由该集群中的点数加权i。也就是说,每个质心都被计数Ni。例如,数据中有两个质心,1和2,SSbetween =N1*D1^2+N2*D2^2其中D1和D2是质心与总均值的偏差。前一段中“加权”一词的来源。
例子
Data (N=6: N1=3, N2=2, N3=1)
V1 V2 Group
2.06 7.73 1
.67 5.27 1
6.62 9.36 1
3.16 5.23 2
7.66 1.27 2
5.59 9.83 3
SSdeviations
V1 V2 Overall
SSt 37.82993333 + 51.24408333 = 89.07401666
SSw 29.50106667 + 16.31966667 = 45.82073333
SSb 8.328866667 + 34.92441667 = 43.25328333
SSt is directly related to the squared Euclidean distances between the data points:
Matrix of squared Euclidean distances
.00000000 7.98370000 23.45050000 7.46000000 73.09160000 16.87090000
7.98370000 .00000000 52.13060000 6.20170000 64.86010000 45.00000000
23.45050000 52.13060000 .00000000 29.02850000 66.52970000 1.28180000
7.46000000 6.20170000 29.02850000 .00000000 35.93160000 27.06490000
73.09160000 64.86010000 66.52970000 35.93160000 .00000000 77.55850000
16.87090000 45.00000000 1.28180000 27.06490000 77.55850000 .00000000
Its sum/2, the sum of the distances = 534.4441000
534.4441000 / N = 89.07401666 = SSt
The same reasoning holds for SSb.
Matrix of squared Euclidean distances between the 3 group centroids (see https://stats.stackexchange.com/q/148847/3277)
.00000000 22.92738889 11.76592222
22.92738889 .00000000 43.32880000
11.76592222 43.32880000 .00000000
3 centroids are 3 points, but SSb is based on N points (propagated centroids):
N1 points representing centroid1, N2 points representing centroid2 and N3 representing centroid3.
Therefore the sum of the distances must be weighted accordingly:
N1*N2*22.92738889 + N1*N3*11.76592222 + N2*N3*43.32880000 = 259.51969998
259.51969998 / N = 43.25328333 = SSb
Moral in words: maximizing SSb is equivalent to maximizing
the weighted sum of pairwise squared distances between the centroids.
(And maximizing SSb corresponds to minimizing SSw, since SSt is constant.)
您正在寻找的是König-Huygens 定理。