为什么 np.linalg.eig 生成的特征向量与存储在 PCA 对象实例中的 PCA 组件不同?

数据挖掘 机器学习 scikit-学习 主成分分析
2022-02-24 14:23:33

我试图理解为什么eVec(由 产生np.linalg.eigpca.components_.T与 PCA 类的实例不同。据我了解,协方差矩阵的特征向量是特征值降序排序后的主成分。

简单的解释将不胜感激。

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

df = pd.read_csv(
'https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv'
)

df = df.drop(['model', 'vs', 'am'], axis = 1)

df = df.apply(lambda x: pd.to_numeric(x))

M = df.to_numpy()

Mnorm = M-np.mean(M, axis=0)

Mnorm = Mnorm/np.std(M, axis=0)
#  This is the normalized source data.

C = (Mnorm.T @ Mnorm) / (Mnorm.shape[0] - 1)
#  This is the Covariance Matrix without bias.

eVal1, eVec1 = np.linalg.eig(C)

eVal = eVal1[np.flip(np.argsort(eVal1))]
#  eVal is sorted according to the order of the eigenvalues.

eVec = eVec1[np.flip(np.argsort(eVal1))]
#  The same sort order as above is applied to the eigenvectors.

### From sklearn:

scaler = StandardScaler()

scaler = scaler.fit(df.to_numpy())

Anorm = scaler.transform(df.to_numpy())   

pca = PCA(n_components=9)

pca_transform = pca.fit_transform(Anorm)

assert (Mnorm == Anorm).all().all()
#  This tests that Mnorm was probably constructed correctly.

assert (C.round(10) == pca.get_covariance().round(10)).all().all()
#  This indicates that the Covariance Matrix (C) was constructed correctly - the rounding is arbitrary.

assert (eVec.round(5) == pca.components_.T.round(5)).all().all()
#  However, eVec and pca.components_.T are not equal.
1个回答

两个问题:

  1. 您的排序不正确:
eVec = eVec1[np.flip(np.argsort(eVal1))]

对矩阵的进行排序,但您想要对列进行排序。将其替换为

eVec = eVec1[:, np.flip(np.argsort(eVal1))]

修复了这个问题。

  1. 特征向量的符号有时相反。(这很好,作为一个特征向量是尺度不变的,虽然两者都np.linalg使用sklearn单位特征向量,但仍然可以选择两个。我不确定这些包最终是如何选择一个的。)