机器算法验证 - 为 PCA 图构建人工略微重叠的数据 - 吾爱随笔录

为 PCA 图构建人工略微重叠的数据

机器算法验证 r 主成分分析数据集

2022-03-25 18:03:31

我正在尝试构建在 PCA 图中显示两个不同组的人工数据。但是，这两组仍应略有重叠。以下方法最接近，但我仍然不满意。即使更改参数rnorm并runif没有成功。一组总是比另一组大得多。

d1 <- {}
d2 <- {}
for(i in 1:20) {    
  d1 <- cbind(d1, rnorm(20, mean=100, sd=10))   
  d2 <- cbind(d2, rnorm(20, mean=110, sd=5))    
  # d1 <- rbind(d1, runif(100, 1,4))    
  # d2 <- rbind(d2, runif(100, 1,3))
}

d <- as.data.frame(rbind(d1,d2))
res <- prcomp(d, center = TRUE, scale = TRUE, na.action = na.omit)
d$group <- as.factor(c(rep("g1", 20), rep("g2", 20))) 
ggplot(d, aes(V1,V2)) +                 
  geom_point(aes(color = group), size=4)

1个回答

要模拟 PCA，请从所需结果开始并向后工作：在正交方向上添加一些随机误差并随机旋转。

下面的例子规定了一个带有两个“斑点”的二维结果（即两个主成分）；它很容易扩展到更多的 blob。（更多的维度需要更多的修改工作sigma以适应更高维度的协方差矩阵。）

让我们从一个随机旋转矩阵开始。

set.seed(17)
p <- 5     # dimensions
rot <- qr.Q(qr(matrix(rnorm(p^2), p)))

我们从不同的多元正态分布中分别生成 blob。参数（它们的均值和协方差矩阵）隐藏在mvrnorm参数中。为了使指定这种分布的形状变得容易和可靠，我们创建了一个小函数sigma来将主轴的角度和两个方差转换为协方差矩阵。

sigma <- function(theta=0, lambda=c(1,1)) {
  cos.t <- cos(theta); sin.t <- sin(theta)
  a <- matrix(c(cos.t, sin.t, -sin.t, cos.t), ncol=2)
  t(a) %*% diag(lambda) %*% a
}
library(MASS)
n1 <- 50   # First group population
n2 <- 75   # Second group population
x <- rbind(mvrnorm(n1, c(-2,-1), sigma(0, c(1/2,1))),
           mvrnorm(n2, c(0,1), sigma(pi/3, c(1, 1/3))))

邻接正交误差并旋转：

eps <- 0.25  # Error SD should be small compared to the SDs for the blobs
x <- cbind(x, matrix(rnorm(dim(x)[1]*(p-2), sd=eps), ncol=p-2))
y <- x %*% rot

那是模拟数据集。要检查它，请应用 PCA：

fit <- prcomp(y)          # PCA
summary(fit)              # Brief summary showing two principal components
par(mfrow=c(2,2))         # Prepare to plot
plot(fit$x[, 1:2], asp=1) # Display the first two components $
plot(x[, 1:2], asp=1);    # Display the original data for comparison
points(x[1:n1,1:2], col="Blue") #...distinguish one of the blobs
screeplot(fit)            # The usual screeplot, supporting the summary
zapsmall(rot %*% fit$rotation, digits=2) # Compare the fitted and simulated rotations.

最后的检查生成一个矩阵，其块形式（上部 2 x 2 块和下部 3 x 3 块，其他地方接近零）证实了 PCA 估计的准确性：

       PC1   PC2   PC3   PC4   PC5
[1,]  0.58  0.81  0.05  0.00 -0.01
[2,]  0.81 -0.58  0.00  0.00  0.00
[3,] -0.01 -0.01  0.23 -0.62 -0.75
[4,]  0.03  0.03 -0.96 -0.28 -0.06
[5,]  0.00  0.00  0.18 -0.73  0.66

（上面的 2 x 2 块，由前两个特征值适当缩放，大致描述了“拟合”图中的点与“原始组件”图中的点之间的关系。这个看起来像旋转和反射。）

其它你可能感兴趣的问题

上一篇理解 Behrens-Fisher 问题下一篇在测试多重相关性时，使用引导程序是否会改变您处理 I 类错误问题的方式？