如何将高斯径向基函数核 PCA 应用于非线性数据？

机器算法验证主成分分析 Python scikit-学习内核技巧

2022-04-04 10:01:06

我的任务是实现高斯径向基函数-内核主成分分析（RBF-内核 PCA），并且在这里遇到一些挑战。如果有人能指出我正确的方向，那就太好了，因为我显然在这里做错了什么。

所以，当我理解正确时，RBF内核是这样实现的：

K (x_{i}, x_{j}) = e x p (- γ ‖ x_{i} - x_{j} ‖_{2}^{2}) = e x p (- \frac{‖ x_{i} - x_{j} ‖_{2}^{2}}{2 σ^{2}}),

$K(\mathbf{x}_i, \mathbf{x}_j) = \mathrm{exp}\left(- \gamma \|\mathbf{x}_i - \mathbf{x}_j\|^{2}_{2} \right)=\mathrm{exp}\left(- \frac{\|\mathbf{x}_i - \mathbf{x}_j\|^{2}_{2}}{2\sigma^2} \right),$

其中是两个数据之间的平方欧几里得距离点，和，而是一个自由参数。可以选择 \sigma^2 作为所有数据点对之间的欧几里得距离的方差 $\|\mathbf{x}_i - \mathbf{x}_j\|^{2}_{2} = \sum_j(x_{ik} - x_{jk})^2$ $\mathbf{x}_i$ $\mathbf{x}_j$ $\gamma$ $\gamma = \frac{1}{2\sigma^2}$ $\sigma^2$

为了比较我的 scikit-learn 实现方法，我创建了一个简单的非线性数据集：

示例数据集

import matplotlib.pyplot as plt

from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, random_state=123)

plt.figure(figsize=(8,6))

plt.scatter(X[y==0, 0], X[y==0, 1], color='red')
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue')

plt.title('A nonlinear 2Ddataset')
plt.ylabel('y coordinate')
plt.xlabel('x coordinate')

在此处输入图像描述

scikit-learn RBF 内核 PCA

当我使用 scikit-learn 实现在 1 个分量轴上进行降维时，类分离得非常好。

scikit_kpca = KernelPCA(n_components=1, kernel='rbf', gamma=15)
X_skernpca = scikit_kpca.fit_transform(X)

plt.figure(figsize=(8,6))
plt.scatter(X_skernpca[y==0, 0], np.zeros((50,1)), color='red', alpha=0.5)
plt.scatter(X_skernpca[y==1, 0], np.zeros((50,1)), color='blue', alpha=0.5)

plt.title('First component after RBF Kernel PCA')
plt.show()

在此处输入图像描述

我的方法

不知何故，我无法重现这些结果。据我了解，我必须计算所有成对距离才能计算内核。然后将 Kernel 居中并提取对应于最大特征值的特征向量。这是我到目前为止所做的：

from sklearn.preprocessing import KernelCenterer
from scipy.spatial.distance import pdist, squareform
from scipy import exp


# pdist to calculate the squared Euclidean distances for every pair of points
# in the 100x2 dimensional dataset.
sq_dists = pdist(X, 'sqeuclidean')

# Variance of the Euclidean distance between all pairs of data points.
variance = np.var(sq_dists)

# squareform to converts the pairwise distances into a symmetric 100x100 matrix
mat_sq_dists = squareform(sq_dists)

# set the gamma parameter equal to the one I used in scikit-learn KernelPCA
gamma = 15

# Compute the 100x100 kernel matrix
K = exp(gamma * mat_sq_dists)

# Center the kernel matrix
kern_cent = KernelCenterer()
K = kern_cent.fit_transform(K)

# Get the eigenvector with largest eigenvalue
eigvals, eigvecs = np.linalg.eig(K)
eigvals, eigvecs = zip(*sorted(zip(eigvals, eigvecs), reverse=True))
X_pc1 = eigvecs[0]

在此处输入图像描述

编辑

非常感谢@Kirill！他发现了我的错误，现在问题已经解决了！这是供将来参考的正确版本：

from sklearn.preprocessing import KernelCenterer
from scipy.spatial.distance import pdist, squareform
from scipy import exp
from scipy.linalg import eigh


# pdist to calculate the squared Euclidean distances for every pair of points
# in the 100x2 dimensional dataset.
sq_dists = pdist(X, 'sqeuclidean')

# Variance of the Euclidean distance between all pairs of data points.
variance = np.var(sq_dists)

# squareform to converts the pairwise distances into a symmetric 100x100 matrix
mat_sq_dists = squareform(sq_dists)

# set the gamma parameter equal to the one I used in scikit-learn KernelPCA
gamma = 15

# Compute the 100x100 kernel matrix
K = exp(-gamma * mat_sq_dists)

# Center the kernel matrix
kern_cent = KernelCenterer()
K = kern_cent.fit_transform(K)

# Get eigenvalues in ascending order with corresponding 
# eigenvectors from the symmetric matrix
eigvals, eigvecs = eigh(K)

# Get the eigenvectors that corresponds to the highest eigenvalue
X_pc1 = eigvecs[:,-1]

在此处输入图像描述

1个回答

第一个问题似乎是的符号gamma是错误的（它应该是负数：，就像在内核的定义中一样，而不是在您的代码中）。或者，使用. $-15$ exp(-gamma * mat_sq_dists)

第二个问题是，zip当您对列表进行排序时，您调用 's 会破坏特征向量。第个特征向量是，不是，根据（另外：你应该更喜欢，因为你有一个对称的实矩阵）。 $i$ eigvecs[:,i]eigvecs[i,:]scipy.linalg.eigheigheig

代替

< gamma = 15
> gamma = -15

和（为了得到有序的真实特征值）

< eigvals, eigvecs = np.linalg.eig(K)
> eigvals, eigvecs = scipy.linalg.eigh(K)

和

< eigvals, eigvecs = zip(*sorted(zip(eigvals, eigvecs), reverse=True))
< X_pc1 = eigvecs[0]
> X_pc1 = eigvecs[:,99]

最后，您可以在此处scikit-learn检查自己的实现。

其它你可能感兴趣的问题

上一篇截断 Gamma 分布参数估计下一篇具有可重复分析的好论文，只需要基础知识