K-means:为什么最小化 WCSS 就是最大化集群之间的距离?

机器算法验证 聚类 k-均值 距离 平方和 欧几里得
2022-03-30 19:11:21

从概念和算法的角度来看,我了解 K-means 的工作原理。但是,从数学的角度来看,我不明白为什么最小化 WCSS(簇内平方和)必然会最大化簇之间的距离换句话说,有人能说明这个函数如何等于最大化集群之间的距离吗?查看显示所有步骤的推导会很有帮助,或者请向我指出适当的参考资料。

更新我找到了这个 Witten 和 Tibshirani的参考资料,但是如何从等式 7 到等式 8 并不明显。

2个回答

K-means 是关于方差分析范式的。ANOVA - 单变量和多变量 - 是基于这样一个事实,即关于大质心的平方偏差总和由关于组质心的这种散布和关于大质心的散布组成:SStotal=SSwithin+SSbetween因此,如果SSwithin被最小化,那么SSbetween被最大化。

已知某些点关于其质心(算术平均值)的偏差 SS 与点之间的总平方欧几里德距离直接相关:与质心的平方偏差之和等于两两平方欧几里德距离之和除以数量的点(这是centroid的三角特性的直接扩展。并且这种关系也被用于距离矩阵的双中心化。)

因此,说“质心(作为点)的 SSbetween 被最大化”是说“质心之间的(加权)平方距离集被最大化”的别名。

Ni注意:在 SSbetween 每个质心都由该集群中的点数加权i也就是说,每个质心都被计数Ni例如,数据中有两个质心,12,SSbetween =N1*D1^2+N2*D2^2其中D1D2是质心与总均值的偏差。前一段中“加权”一词的来源。

例子

Data (N=6: N1=3, N2=2, N3=1)
        V1            V2       Group
       2.06          7.73          1
        .67          5.27          1
       6.62          9.36          1
       3.16          5.23          2
       7.66          1.27          2
       5.59          9.83          3

SSdeviations
        V1            V2            Overall
SSt   37.82993333 + 51.24408333 = 89.07401666
SSw   29.50106667 + 16.31966667 = 45.82073333
SSb   8.328866667 + 34.92441667 = 43.25328333

SSt is directly related to the squared Euclidean distances between the data points:

Matrix of squared Euclidean distances
     .00000000    7.98370000   23.45050000    7.46000000   73.09160000   16.87090000 
    7.98370000     .00000000   52.13060000    6.20170000   64.86010000   45.00000000 
   23.45050000   52.13060000     .00000000   29.02850000   66.52970000    1.28180000 
    7.46000000    6.20170000   29.02850000     .00000000   35.93160000   27.06490000 
   73.09160000   64.86010000   66.52970000   35.93160000     .00000000   77.55850000 
   16.87090000   45.00000000    1.28180000   27.06490000   77.55850000     .00000000 

Its sum/2, the sum of the distances = 534.4441000
534.4441000 / N = 89.07401666 = SSt

The same reasoning holds for SSb.

Matrix of squared Euclidean distances between the 3 group centroids (see https://stats.stackexchange.com/q/148847/3277)
     .00000000   22.92738889   11.76592222 
   22.92738889     .00000000   43.32880000 
   11.76592222   43.32880000     .00000000 

3 centroids are 3 points, but SSb is based on N points (propagated centroids):
N1 points representing centroid1, N2 points representing centroid2 and N3 representing centroid3.
Therefore the sum of the distances must be weighted accordingly:
N1*N2*22.92738889 + N1*N3*11.76592222 + N2*N3*43.32880000 = 259.51969998
259.51969998 / N = 43.25328333 = SSb

Moral in words: maximizing SSb is equivalent to maximizing
the weighted sum of pairwise squared distances between the centroids.
(And maximizing SSb corresponds to minimizing SSw, since SSt is constant.)

您正在寻找的是König-Huygens 定理