我有兴趣生成高斯混合分布作为一系列两样本测试模拟的零分布。当原假设为真时,p 值遵循均匀分布是一个公认的事实,并且Stack Exchange 上已经提供了几个很好的解释。我在两个高斯分布之间进行测试时观察到这一点,但是一旦我开始在两个多峰高斯混合分布之间进行测试,p 值的分布就会偏离均匀性并表现出负偏度。为什么我要观察这种行为?
例子
下面我演示了使用包的两个样本Anderson-Darling 检验的预期均匀分布的 p 值。kSamples
R
(请原谅代码中效率低下的部分。)
library(kSamples)
numsims <- 1000
repNums <- 1:numsims
pvalDF <- data.frame(ii = rep(0, length(repNums)),
pvals = rep(0, length(repNums)),
seed = rep(0, length(repNums)))
for (ii in 1:numsims) {
set.seed(123 + ii)
sample1 <- rnorm(50, mean = 0.5, sd = 0.05)
sample2 <- rnorm(50, mean = 0.5, sd = 0.05)
n1 = length(sample1)
n2 = length(sample2)
samples <- scale(c(sample1, sample2))
sample1 <- samples[1:n1]
sample2 <- samples[n1+1:n2]
ad_out <- ad.test(sample1, sample2)
pvalDF$ii[ii] <- ii
pvalDF$pvals[ii] <- ad_out$ad["version 1:", 3]
pvalDF$seed[ii] <- 321 + ii
}
hist(pvalDF$pvals, xlim = c(0, 1))
但是,一旦将零分布构造为两个高斯分布的混合,p 值的分布就会偏离均匀性:
numsims <- 1000
repNums <- 1:numsims
pvalDF <- data.frame(ii = rep(0, length(repNums)),
pvals = rep(0, length(repNums)),
seed = rep(0, length(repNums)))
for (ii in 1:numsims) {
set.seed(123 + ii)
sample1.1 <- rnorm(25, mean = 0.5, sd = 0.05) # Here are the new two samples each
sample1.2 <- rnorm(25, mean = 0.75, sd = 0.05) # generated as a mixture from two
sample1 <- c(sample1.1, sample1.2) # identical Gaussian distributions.
sample2.1 <- rnorm(25, mean = 0.5, sd = 0.05) #
sample2.2 <- rnorm(25, mean = 0.75, sd = 0.05) #
sample2 <- c(sample2.1, sample2.2) #
n1 = length(sample1)
n2 = length(sample2)
samples <- scale(c(sample1, sample2))
sample1 <- samples[1:n1]
sample2 <- samples[n1+1:n2]
ad_out <- ad.test(sample1, sample2)
pvalDF$ii[ii] <- ii
pvalDF$pvals[ii] <- ad_out$ad["version 1:", 3]
pvalDF$seed[ii] <- 321 + ii
}
hist(pvalDF$pvals, xlim = c(0, 1))
这对我来说似乎违反直觉。两个零假设(两个样本的基本分布相等)都是正确的,但是对于后一个双峰样本,p 值不再均匀分布。
虽然有些多余,但通过向两个样本添加额外的高斯分布,这种效果被进一步夸大了:
numsims <- 1000
repNums <- 1:numsims
pvalDF <- data.frame(ii = rep(0, length(repNums)),
pvals = rep(0, length(repNums)),
seed = rep(0, length(repNums)))
for (ii in 1:numsims) {
set.seed(123 + ii)
sample1.1 <- rnorm(10, mean = 0.5, sd = 0.05)
sample1.2 <- rnorm(10, mean = 0.75, sd = 0.05)
sample1.3 <- rnorm(10, mean = 1, sd = 0.05)
sample1.4 <- rnorm(10, mean = 1.25, sd = 0.05)
sample1.5 <- rnorm(10, mean = 1.5, sd = 0.05)
sample1 <- c(sample1.1, sample1.2, sample1.3, sample1.4, sample1.5)
sample2.1 <- rnorm(10, mean = 0.5, sd = 0.05)
sample2.2 <- rnorm(10, mean = 0.75, sd = 0.05)
sample2.3 <- rnorm(10, mean = 1, sd = 0.05)
sample2.4 <- rnorm(10, mean = 1.25, sd = 0.05)
sample2.5 <- rnorm(10, mean = 1.5, sd = 0.05)
sample2 <- c(sample2.1, sample2.2, sample2.3, sample2.4, sample2.5)
n1 = length(sample1)
n2 = length(sample2)
samples <- scale(c(sample1, sample2))
sample1 <- samples[1:n1]
sample2 <- samples[n1+1:n2]
ad_out <- ad.test(sample1, sample2)
pvalDF$ii[ii] <- ii
pvalDF$pvals[ii] <- ad_out$ad["version 1:", 3]
pvalDF$seed[ii] <- 321 + ii
}
hist(pvalDF$pvals, xlim = c(0, 1))
是什么导致了这种影响?我不相信这与这篇文章有关,因为这些 p 值可以采用更多的可能值。我还使用为双变量样本设计的不同测试重现了这种效果。