机器算法验证 - PCA 和自动编码器有什么区别？ - 吾爱随笔录

PCA 和自动编码器有什么区别？

机器算法验证机器学习主成分分析神经网络自动编码器

2022-01-24 23:24:07

PCA 和 autoencoder 都可以降维，那么它们有什么区别呢？在什么情况下我应该使用一个而不是另一个？

4个回答

PCA 仅限于线性映射，而自动编码器可以具有非线性编码器/解码器。

具有线性传递函数的单层自动编码器几乎等同于 PCA，其中几乎意味着 AE 和 PCA 找到的 $W$ 不一定相同 - 但各自的 $W$ 跨越的子空间会。

正如bayerj 指出的那样，PCA 是一种假设线性系统的方法，而自动编码器（AE）则没有。如果在 AE 中没有使用非线性函数，并且隐藏层中的神经元数量比输入的维度更小，那么 PCA 和 AE 可以产生相同的结果。否则，AE 可能会找到不同的子空间。

需要注意的一件事是，AE 中的隐藏层可以比输入具有更大的维度。在这种情况下，AE 可能不会进行降维。在这种情况下，我们将它们视为从一个特征空间到另一个特征空间的转换，其中新特征空间中的数据解开了变化因素。

关于您对bayerj的回复中多层是否意味着非常复杂的非线性的问题。根据您所说的“非常复杂的非线性”的含义，这可能是正确的。然而，深度确实提供了更好的泛化。许多方法需要与区域数相等的样本数。然而事实证明，根据Bengio 等人的说法，“可以使用 $O(N)$ 示例定义大量区域，例如 $O(2^N)$” 。这是由于从网络中的较低层组合较低特征而产生的表示复杂性的结果。 $O(2^N)$ , can be defined with $O(N)$ examples" according to

@bayerj 目前接受的答案指出，线性自动编码器的权重与 PCA 找到的主成分跨越相同的子空间，但它们不是相同的向量。特别是，它们不是正交基。这是真的，但是我们可以很容易地从自动编码器的权重中恢复主成分加载向量。一点符号：设 $\{\mathbf{x}_i \in \mathbb{R}^n \}_{i=1}^N $ 是一组 $N$ $n-$ 维向量，我们希望为其计算 PCA，并令 $X$ 是其列为 $\mathbf{x}_1,\dots,\mathbf{x}_N$ 的矩阵。然后，让我们将线性自编码器定义为由以下等式定义的单层神经网络： $\{\mathbf{x}_i \in \mathbb{R}^n \}_{i=1}^N$ be a set of $N$ $n-$ dimensional vectors, for which we wish to compute the PCA, and let $X$ be the matrix whose columns are $\mathbf{x}_1,\dots,\mathbf{x}_N$ . Then, let's define a linear autoencoder as the one-hidden layer neural network defined by the following equations:

\begin{aligned} h_{1} & = W_{1} x + b_{1} \\ \hat{x} & = W_{2} h_{1} + b_{2} \end{aligned}

$\begin{align} \mathbf{h}_1 & = \mathbf{W}_1\mathbf{x} + \mathbf{b}_1 \\ \hat{\mathbf{x}} & = \mathbf{W}_2\mathbf{h}_1 + \mathbf{b}_2 \end{align}$

其中 $\hat{\mathbf{x}}$ 是（线性）自动编码器的输出，用帽子表示，以强调自动编码器的输出是输入的“重构”这一事实。请注意，由于它最常见于自动编码器，隐藏层的单元比输入层少，即 $W_1\in \mathbb{R}^{n \times m}$ 和 $W_2\in \mathbb{R} ^{m \times n}$ 与 $m < n$。 $\hat{\mathbf{x}}$ is the output of the (linear) autoencoder, denoted with a hat in order to stress the fact that the output of an autoencoder is a "reconstruction" of the input. Note that, as it's most common with autoencoders, the hidden layer has less units than the input layer, i.e., $W_1\in \mathbb{R}^{n \times m}$ and $W_2\in \mathbb{R}^{m \times n}$ with $m < n$ .

现在，在训练线性自动编码器之后，计算 $W_2$ 的前 $m$ 个奇异向量。可以证明这些奇异向量实际上是 $X$ 的第一个 $m$ 主成分，证明在 Plaut, E., From Principal Subspaces to Principal Components with Linear Autoencoders , Arxiv.org:1804.10253。 $m$ singular vectors of $W_2$ . It's possible to prove that these singular vectors are actually the first $m$ principal components of $X$ , and the proof is in Plaut, E.,

由于 SVD 实际上是通常用于计算 PCA 的算法，因此首先训练线性自动编码器然后将 SVD 应用于 $W_2$ 以恢复然后首先 $m$ 加载向量似乎没有意义，而不是直接将 SVD 应用于 $X美元。关键是$X$ 是$n\times N$ 矩阵，而$W_2$ 是$m\times n$。现在，$W_2$ 的 SVD 的时间复杂度是 $O(m^2n)$，而 $X$ 的时间复杂度是 $O(n^2N)$，$m < n $，因此可以节省一些（甚至如果没有我链接的论文作者声称的那么大）。当然，还有其他更有用的方法来计算大数据的 PCA（我想到了随机在线 PCA），但是线性自动编码器和 PCA 之间这种等价的要点不是找到一种实用的方法来计算大数据的 PCA套：它' $W_2$ in order to recover then first $m$ loading vectors, rather than directly applying SVD to $X$ . The point is that $X$ is a $n \times N$ matrix, while a $W_2$ is $m\times n$ . Now, the time complexity of SVD for $W_2$ is $O(m^2n)$ , while for $X$ is $O(n^2N)$ with $m < n$ , thus some saving could be attained (even if not as big as claimed by the author of the paper I link). Of course, there are other more useful approaches to compute the PCA of Big Data (randomized online PCA comes to mind), but the main point of this equivalence between linear autoencoders and PCA is not to find a practical way to compute PCA for huge data sets: it's more about giving us an intuition on the connections between autoencoders and other statistical approaches to dimension reduction.

一般的答案是自联想神经网络可以执行非线性降维。训练网络通常不如 PCA 快，因此权衡是计算资源与表达能力。

但是，在细节上存在混淆，这是一种常见的误解。确实，具有线性激活函数的自动关联网络与 PCA 一致，无论隐藏层的数量如何。然而，如果只有 1 个隐藏层（输入-隐藏-输出），最优自关联网络仍然与 PCA 一致，即使是非线性激活函数。有关原始证明，请参见Boulard 和 Kamp 1988 年的论文。Chris Bishop 的书在第 12.4.2 章中对这种情况进行了很好的总结：

可以认为，可以通过对图 12.18 中的网络中的隐藏单元使用非线性（S 型）激活函数来克服线性降维的限制。然而，即使使用非线性隐藏单元，最小误差解决方案仍由主成分子空间上的投影给出（Bourlard 和 Kamp，1988 年）。因此，使用两层神经网络进行降维没有优势。

其它你可能感兴趣的问题

上一篇为什么可以获得显着的 F 统计量 (p<.001) 但不显着的回归量 t 检验？下一篇数学家想要与质量统计学位同等的知识