如果我运行下面的 R 代码,它将生成两个独立的向量,然后测试它们以查看它们是否以某种方式相关(即 p 值 < 0.05)。
如果我重复这 1000 次,那么其中 50 个(5%)将有误报,p 值 < 0.05。这是 1 类错误。
如果我们增加到sampleSize
1000,甚至 100,000,结果是一样的(5% 的误报)。
我很难理解这一点,因为我原以为如果我们收集了足够的样本,误报的机会就会下降到 0(就像相关性一样)。
所以,我想我的问题是:“基于比较两个独立数据集生成的 p 值,误报的数量怎么可能与样本量无关?”。
# R code to demonstrate that with a large dataset, we can still
# get significant p-values, purely by chance.
# Change this to 1000, and we still get the same number of false positives (50, or 5%).
sampleSize = 20
cat("Sample size:",sampleSize,"\n")
set.seed(1010093)
n=1000
pValues <- rep(NA,n)
for(i in 1:n){
y <- rnorm(sampleSize)
x <- rnorm(sampleSize)
pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
# Controls false positive rate
fp = sum(pValues < 0.05)
cat("Out of ",n," tests, ",fp," had a p-value < 0.05, purely by by chance.\n",sep="")
----output----
Running "true positives - none.R" ...
Sample size: 1000
Out of 1000 tests, 52 had a p-value < 0.05, purely by by chance.