为什么当空值为真时,我观察到混合分布的两样本检验的非均匀分布(负偏态)p 值?

机器算法验证 r p 值 模拟 两个样本
2022-04-16 08:24:24

我有兴趣生成高斯混合分布作为一系列两样本测试模拟的零分布。当原假设为真时,p 值遵循均匀分布是一个公认的事实,并且Stack Exchange 上已经提供了几个很好的解释。我在两个高斯分布之间进行测试时观察到这一点,但是一旦我开始在两个多峰高斯混合分布之间进行测试,p 值的分布就会偏离均匀性并表现出负偏度。为什么我要观察这种行为?

例子

下面我演示了使用的两个样本Anderson-Darling 检验的预期均匀分布的 p 值。kSamples R(请原谅代码中效率低下的部分。)

library(kSamples)

numsims <- 1000
repNums <- 1:numsims
pvalDF <- data.frame(ii = rep(0, length(repNums)),
                     pvals = rep(0, length(repNums)),
                     seed = rep(0, length(repNums)))

for (ii in 1:numsims) {
  set.seed(123 + ii)
 
  sample1 <- rnorm(50, mean = 0.5, sd = 0.05)
  sample2 <- rnorm(50, mean = 0.5, sd = 0.05)
 
  n1 = length(sample1)
  n2 = length(sample2)
  samples <- scale(c(sample1, sample2))
  sample1 <- samples[1:n1]
  sample2 <- samples[n1+1:n2]
 
  ad_out <- ad.test(sample1, sample2)
 
  pvalDF$ii[ii] <- ii
  pvalDF$pvals[ii] <- ad_out$ad["version 1:", 3]
  pvalDF$seed[ii] <- 321 + ii
}

hist(pvalDF$pvals, xlim = c(0, 1))

近似均匀分布的 p 值的直方图

但是,一旦将零分布构造为两个高斯分布的混合,p 值的分布就会偏离均匀性:

numsims <- 1000
repNums <- 1:numsims
pvalDF <- data.frame(ii = rep(0, length(repNums)),
                     pvals = rep(0, length(repNums)),
                     seed = rep(0, length(repNums)))

for (ii in 1:numsims) {
  set.seed(123 + ii)
 
  sample1.1 <- rnorm(25, mean = 0.5, sd = 0.05)  # Here are the new two samples each
  sample1.2 <- rnorm(25, mean = 0.75, sd = 0.05) # generated as a mixture from two
  sample1 <- c(sample1.1, sample1.2)             # identical Gaussian distributions.
  sample2.1 <- rnorm(25, mean = 0.5, sd = 0.05)  #
  sample2.2 <- rnorm(25, mean = 0.75, sd = 0.05) #
  sample2 <- c(sample2.1, sample2.2)             #
 
  n1 = length(sample1)
  n2 = length(sample2)
  samples <- scale(c(sample1, sample2))
  sample1 <- samples[1:n1]
  sample2 <- samples[n1+1:n2]
 
  ad_out <- ad.test(sample1, sample2)
 
  pvalDF$ii[ii] <- ii
  pvalDF$pvals[ii] <- ad_out$ad["version 1:", 3]
  pvalDF$seed[ii] <- 321 + ii
}

hist(pvalDF$pvals, xlim = c(0, 1))

非均匀分布 p 值的直方图

这对我来说似乎违反直觉。两个零假设(两个样本的基本分布相等)都是正确的,但是对于后一个双峰样本,p 值不再均匀分布。

虽然有些多余,但通过向两个样本添加额外的高斯分布,这种效果被进一步夸大了:

numsims <- 1000
repNums <- 1:numsims
pvalDF <- data.frame(ii = rep(0, length(repNums)),
                     pvals = rep(0, length(repNums)),
                     seed = rep(0, length(repNums)))

for (ii in 1:numsims) {
  set.seed(123 + ii)
 
  sample1.1 <- rnorm(10, mean = 0.5, sd = 0.05)
  sample1.2 <- rnorm(10, mean = 0.75, sd = 0.05)
  sample1.3 <- rnorm(10, mean = 1, sd = 0.05)
  sample1.4 <- rnorm(10, mean = 1.25, sd = 0.05)
  sample1.5 <- rnorm(10, mean = 1.5, sd = 0.05)
  sample1 <- c(sample1.1, sample1.2, sample1.3, sample1.4, sample1.5)
  sample2.1 <- rnorm(10, mean = 0.5, sd = 0.05)
  sample2.2 <- rnorm(10, mean = 0.75, sd = 0.05)
  sample2.3 <- rnorm(10, mean = 1, sd = 0.05)
  sample2.4 <- rnorm(10, mean = 1.25, sd = 0.05)
  sample2.5 <- rnorm(10, mean = 1.5, sd = 0.05)
  sample2 <- c(sample2.1, sample2.2, sample2.3, sample2.4, sample2.5)
 
  n1 = length(sample1)
  n2 = length(sample2)
  samples <- scale(c(sample1, sample2))
  sample1 <- samples[1:n1]
  sample2 <- samples[n1+1:n2]
 
  ad_out <- ad.test(sample1, sample2)
 
  pvalDF$ii[ii] <- ii
  pvalDF$pvals[ii] <- ad_out$ad["version 1:", 3]
  pvalDF$seed[ii] <- 321 + ii
}

hist(pvalDF$pvals, xlim = c(0, 1))

p 值的负偏斜直方图

是什么导致了这种影响?我不相信这与这篇文章有关,因为这些 p 值可以采用更多的可能值。我还使用为双变量样本设计的不同测试重现了这种效果。

1个回答

您不会从高斯混合中生成两个独立观察样本,因为您正在固定从每个组件中获取的数字,而不是使其随机。

如果观测值进行采样,不是当您将每个分量的观测值固定在时,每个样本中的观测值不是独立的,因此这两个样本比您对独立观测值的预期更相似,并且 p 值大于XN(.5,0.052)N(0.075,0.052)n(n,0.5)n/2n/2U[0,1]

我修改了您的代码以随机抽样这两个组件

numsims <- 1000
repNums <- 1:numsims
pvalDF <- data.frame(ii = rep(0, length(repNums)),
                     pvals = rep(0, length(repNums)),
                     seed = rep(0, length(repNums)))

for (ii in 1:numsims) {
  set.seed(123 + ii)
 
  m1<-rbinom(1,50,.5)
  sample1.1 <- rnorm(m1, mean = 0.5, sd = 0.05)  # Here are the new two samples each
  sample1.2 <- rnorm(50-m1, mean = 0.75, sd = 0.05) # generated as a mixture from two
  sample1 <- c(sample1.1, sample1.2)             # identical Gaussian distributions.
  
  m2<-rbinom(1,50,.5)
  sample2.1 <- rnorm(m2, mean = 0.5, sd = 0.05)  #
  sample2.2 <- rnorm(50-m2, mean = 0.75, sd = 0.05) #
  sample2 <- c(sample2.1, sample2.2)             #
 
  n1 = length(sample1)
  n2 = length(sample2)
  samples <- scale(c(sample1, sample2))
  sample1 <- samples[1:n1]
  sample2 <- samples[n1+1:n2]
 
  ad_out <- ad.test(sample1, sample2)
 
  pvalDF$ii[ii] <- ii
  pvalDF$pvals[ii] <- ad_out$ad["version 1:", 3]
  pvalDF$seed[ii] <- 321 + ii
}

hist(pvalDF$pvals, xlim = c(0, 1))

p值分布又无聊了

在此处输入图像描述