数据挖掘 - 你如何为高斯相似核设置 sigma？ - 吾爱随笔录

你如何为高斯相似核设置 sigma？

数据挖掘相似图表方差

2021-10-15 15:02:26

假设我们有 $n$ 二维向量：

x_{1}, \dots, x_{i}, \dots, x_{n} = (x_{1_{1}}, x_{1_{2}})^{T}, \dots, (x_{i_{1}}, x_{i_{2}})^{T}, \dots, (x_{n_{1}}, x_{n_{2}})^{T}

$\mathbf{x}_1,\dots,\mathbf{x}_i,\dots,\mathbf{x}_n=(x_{1_1},x_{1_2})^T,\dots,(x_{i_1},x_{i_2})^T,\dots,(x_{n_1},x_{n_2})^T$ 你怎么设置

σ

$\sigma$ 对于高斯相似核：

s (x_{i}, x_{j}) = \exp (- \frac{| | x_{i} - x_{j} | |^{2}}{2 σ^{2}})

$s(\mathbf{x}_i,\mathbf{x}_j)=\exp\left(-\frac{||\mathbf{x}_i-\mathbf{x}_j||^2}{2\sigma^2}\right)$

2个回答

更新的答案

根据 Spectral Clustering ( von Luxburg ) 中的参考论文， $\sigma$ 简单地设置为 1。可以通过一些可视化检查应用进一步的调整，但我没有找到有关设置此参数的任何讨论。

使用下面的代码片段可以看到效果：

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist

def gausker(x1,x2,sigma):
    dist = np.linalg.norm(x1-x2)
    return np.exp(-dist**2/(2.*(sigma**2.)))

data = np.array([[0,0],[1,1],[1,0],[0,1],[10,10],[10,9],[9,10],[9,9]])
plt.figure()
plt.plot(data[:,0],data[:,1],'o',ms=20)
plt.show()
s = np.std(pdist(data))
for sigma in [1, s, 10, 100]:
    gaus = np.zeros((8,8))
    for ii in range(8):
        for jj in range(8):
            gaus[ii,jj] = gausker(data[ii,:],data[jj,:],sigma)
    plt.figure()
    plt.imshow(gaus,extent=[0, 1, 0, 1])
    plt.colorbar()
    plt.title(str(sigma))
    plt.show()

对于机器学习算法来说最好有更多的区别。高斯相似性核关心局部相似性。图像显示内核 $\sigma=1$

在概念上类似于 k 近邻图，因为它考虑了局部邻域并且几乎忽略了相距较远的两个节点之间的关系。

$\sigma 1$ $\sigma 10$ $\sigma 100$

如果您进行谱聚类，您可能会发现以下感兴趣的论文：

https://papers.nips.cc/paper/2619-self-tuning-spectral-clustering.pdf

作者使用了一个 sigma，它使每个样本在本地适应本地环境。根据我的经验，在处理多尺度数据时很有用。

其它你可能感兴趣的问题

上一篇在预测时间序列时，如何在训练后将测试数据合并回模型中？下一篇解释变量变换后的聚类结果