机器算法验证 - 加权主成分分析 - 吾爱随笔录

加权主成分分析

机器算法验证主成分分析测量误差加权数据

2022-02-11 01:18:05

经过一番搜索，我发现很少将观察权重/测量误差纳入主成分分析。我所发现的往往依赖于迭代方法来包括权重（例如，here）。我的问题是为什么这种方法是必要的？为什么我们不能使用加权协方差矩阵的特征向量？

2个回答

这取决于您的权重到底适用于什么。

行权重

让 $\mathbf{X}$ 是在列中具有变量的数据矩阵，并且 $n$ 观察 $\mathbf x_i$ 在行中。如果每个观测值都有一个相关的权重 $w_i$ ，那么将这些权重合并到 PCA 中确实很简单。

首先，需要计算加权平均值 $\boldsymbol \mu = \frac{1}{\sum w_i}\sum w_i \mathbf x_i$ 并从数据中减去它以使其居中。

然后我们计算加权协方差矩阵 $\frac{1}{\sum w_i}\mathbf X^\top \mathbf W \mathbf X$ ，在哪里 $\mathbf W = \operatorname{diag}(w_i)$ 是权重的对角矩阵，并应用标准 PCA 对其进行分析。

细胞权重

您发现Tamuz 等人在 2013 年发表的论文考虑了不同权重时的更复杂情况 $w_{ij}$ 应用于数据矩阵的每个元素。那么确实没有解析解，必须使用迭代方法。请注意，正如作者所承认的那样，他们重新发明了轮子，因为之前肯定已经考虑过这样的一般权重，例如，在Gabriel 和 Zamir，1979 年，Lower Rank Approximation of Matrices by Least Squares With Any Choice of Weights中。这也在这里讨论过。

作为补充说明：如果权重 $w_{ij}$ 随变量和观测值而变化，但是是对称的，因此 $w_{ij}=w_{ji}$ ，则解析解再次成为可能，请参阅Koren 和 Carmel，2004 年，稳健的线性降维。

非常感谢变形虫对行权重的见解。我知道这不是 stackoverflow，但是我很难找到行加权 PCA 的解释和解释，因为这是谷歌搜索加权 PCA 时的第一个结果，我认为附上我的解决方案会很好，也许它可以帮助处于相同情况的其他人。在这个 Python2 代码片段中，使用上述 RBF 内核加权的 PCA 用于计算 2D 数据集的切线。我会很高兴听到一些反馈！

def weighted_pca_regression(x_vec, y_vec, weights):
    """
    Given three real-valued vectors of same length, corresponding to the coordinates
    and weight of a 2-dimensional dataset, this function outputs the angle in radians
    of the line that aligns with the (weighted) average and main linear component of
    the data. For that, first a weighted mean and covariance matrix are computed.
    Then u,e,v=svd(cov) is performed, and u * f(x)=0 is solved.
    """
    input_mat = np.stack([x_vec, y_vec])
    weights_sum = weights.sum()
    # Subtract (weighted) mean and compute (weighted) covariance matrix:
    mean_x, mean_y =  weights.dot(x_vec)/weights_sum, weights.dot(y_vec)/weights_sum
    centered_x, centered_y = x_vec-mean_x, y_vec-mean_y
    matrix_centered = np.stack([centered_x, centered_y])
    weighted_cov = matrix_centered.dot(np.diag(weights).dot(matrix_centered.T)) / weights_sum
    # We know that v rotates the data's main component onto the y=0 axis, and
    # that u rotates it back. Solving u.dot([x,0])=[x*u[0,0], x*u[1,0]] gives
    # f(x)=(u[1,0]/u[0,0])x as the reconstructed function.
    u,e,v = np.linalg.svd(weighted_cov)
    return np.arctan2(u[1,0], u[0,0]) # arctan more stable than dividing


# USAGE EXAMPLE:
# Define the kernel and make an ellipse to perform regression on:
rbf = lambda vec, stddev: np.exp(-0.5*np.power(vec/stddev, 2))
x_span = np.linspace(0, 2*np.pi, 31)+0.1
data_x = np.cos(x_span)[:-1]*20-1000
data_y = np.sin(x_span)[:-1]*10+5000
data_xy = np.stack([data_x, data_y])
stddev = 1 # a stddev of 1 in this context is highly local
for center in data_xy.T:
    # weight the  points based on their euclidean distance to the current center
    euclidean_distances = np.linalg.norm(data_xy.T-center, axis=1)
    weights = rbf(euclidean_distances, stddev)
    # get the angle for the regression in radians
    p_grad = weighted_pca_regression(data_x, data_y, weights)
    # plot for illustration purposes
    line_x = np.linspace(-5,5,10)
    line_y = np.tan(p_grad)*line_x
    plt.plot(line_x+center[0], line_y+center[1], c="r")
    plt.scatter(*data_xy)
    plt.show()

和一个示例输出（它对每个点都做同样的事情）：

干杯，
安德烈斯

其它你可能感兴趣的问题

上一篇什么是回归中的零模型，它与零假设有何关系？下一篇当数据有偏差时应该使用平均值吗？