机器算法验证 - 经验性地测试“p-test”（多次公平硬币翻转） - 吾爱随笔录

经验性地测试“p-test”（多次公平硬币翻转）

机器算法验证假设检验 p 值模拟

2022-04-05 08:20:55

在上一个问题中，我问过如何测试硬币是否公平。现在我想凭经验测试这个测试是否有效。

一个答案是，像 R 和 python 这样的程序具有内置的“二项式 p-tests”，可以调用它来执行此操作。

这是一些 python 代码的示例，用于对翻转 1000 个公平硬币的单个案例进行此类 p 测试：

import numpy as np
from numpy import random
import scipy
from scipy import stats


def flipAndDoPTest(numberOfFlips, weight):
    flippedCoinSimulation = np.random.binomial(1, weight, numberOfFlips) #first input changes sum of coins
    numberOfHeads = np.sum(flippedCoinSimulation==1)
    numberOfTails = np.sum(flippedCoinSimulation==0)
    pvalue = stats.binom_test(numberOfHeads, numberOfFlips, weight)
    return pvalue

numberOfFlips = 1000
weight = .50
ptestvalue = flipAndDoPTest(numberOfFlips, weight)
if ptestvalue>.05:
    print("the ptest has a value of:", ptestvalue)
    print("The null hypothesis cannot be rejected at the 5% level of significance because the returned p-value is greater than the critical value of 5%.")
if ptestvalue<.05:
    print("the ptest has a value of:", ptestvalue)
    print("The null hypothesis can be rejected at the 5% level of significance because the returned p-value is less than the critical value of 5%.")

现在我想凭经验测试这个“5% 的显着性水平”是什么意思。似乎对 p 值的解释存在很多分歧，所以我只想模拟我的案例中发生的情况。

首先，我想测试一枚公平硬币是否有 5% 的概率会出现小于 0.05 的 p 值。 为此，我重复了这个 p 测试 1000 次（每个 p 测试都是针对掷硬币 10000 次的事件）。现在我收集 p 值小于 0.05 的所有时间。代码在这里：

numberOfFlips = 10000
weight = .50
numberOfTests = 1000
StatisticalSignificanceProbability = .05
pTestList = np.zeros(numberOfTests) #initialization
for i in range(numberOfTests):
    #for each i in the loop, do a p-test of 10,000 fair coin flips and add it to a list
    ithPTest = flipAndDoPTest(numberOfFlips, weight)
    pTestList[i] = ithPTest
#take this list and count all of the times there are cases below .05
numberOfSheerCoincidences = sum(pTestList<StatisticalSignificanceProbability)
expectedNumberOfSheerCoincidences = numberOfTests*StatisticalSignificanceProbability

print("numberOfSheerCoincidences: ", numberOfSheerCoincidences)
print("expectedNumberOfSheerCoincidences: ", expectedNumberOfSheerCoincidences)

现在我预计我的 1000 个 p 测试中有 5% 将小于 0.05（因此 0.05*1000 = 50）。但是每次我运行它时，我都会得到一个明显小于 50 的数字。现在这个结果有一个随机分布，所以我然后编写代码来重复这个过程以获得结果数据的直方图分布：

numberOfFlips = 100
weight = .50   
numberOfDataPoints = 1000
pTestResultsDataPoints = np.zeros(numberOfDataPoints) #initialization
for j in range(numberOfDataPoints):
    #repeating this collection of p-test to get a range of different values
    numberOfTests = 1000
    StatisticalSignificanceProbability = .05
    pTestList = np.zeros(numberOfTests) #initialization
    for i in range(numberOfTests):
        ithPTest = flipAndDoPTest(numberOfFlips, weight)
        pTestList[i] = ithPTest
    numberOfSheerCoincidences = sum(pTestList<StatisticalSignificanceProbability)
    expectedNumberOfSheerCoincidences = numberOfTests*StatisticalSignificanceProbability
    pTestResultsDataPoints[j] = numberOfSheerCoincidences

n, bins, patches = plt.hist(pTestResultsDataPoints, 50)
plt.show()

有了这个结果，我得到了一个以 35 而不是 50 为中心的分布。

这个结果是预期的吗？我期待一个 50 左右的正态分布。

2个回答

一般来说，没有这样的具有显着性水平的二项式检验 $\alpha = 0.05,$ 由于二项分布的离散性。

对于精确的测试 $\alpha = 0.05$ 基于连续检验统计量，当 P 值的分布 $H_0$ 为真将是标准统一且 P 值低于的概率 $0.05$ 正是 $0.05.$

如果 $n = 100,$ 测试 $H_0: p = .5$ 反对 $H_a: p \ne 0.5,$ 最接近 5% 水平的测试（不超过 5%）是 $0.0352 = 3.52\%.$

2*(1 - pbinom(60, 100, .5))
[1] 0.0352002
2*(1 - pbinom(59, 100, .5))
[1] 0.05688793

[使用标称 5% 水平的正态近似没有帮助，因为 z 值接近 $\pm 1.96$ 无法实现。为了 $\mathsf{Binom}(100,.5)$ 正态近似值非常准确，因此无论是进行“精确”二项式检验还是近似正态检验都无关紧要。]

下面我在 R 中模拟 100,000 个测试 $n=100$ 观察，并总结和绘制 P 值的直方图。概率应该精确到大约两个位置。二项式检验“binom.test”和近似正态检验“prop.test”都具有预期的 P 值。

set.seed(2021);  n = 100
pv.b = replicate(10^5, 
        binom.test(rbinom(1,n,.5),n,.5)$p.val)
mean(pv.b < 0.05)
[1] 0.03605        # aprx 0.0352
2*sd(pv.b < 0.05)/sqrt(10^5)
[1] 0.001178995    # aprx 95% margin of sim error

set.seed(2021);  n = 100
pv.n = replicate(10^5, 
        prop.test(rbinom(1,n,.5),n,.5)$p.val)
mean(pv.n < 0.05)
[1] 0.03605

下图显示了 P 值的模拟分布 $H_0.$

图的R代码：

par(mfrow=c(1,2))
hist(pv.b, prob=T, xlim=c(-.01,1.01), col="skyblue2")
 abline(v = .05, col="red")
 curve(dunif(x), add=T, n=10001, lwd=2)
hist(pv.n, prob=T, xlim=c(-.01,1.01), col="skyblue2")
 abline(v = .05, col="red")
 curve(dunif(x), add=T, n=10001, lwd=2)
par(mfrow=c(1,1))

注：（1）在绘图的分辨率下，两个直方图看起来一样，但分布差异很小；在边界线上，二项式和近似正态测试在 100,000 次测试中的 8070 次测试中 P 值略有不同。但他们从不反对 5% 的拒绝。

sum(pv.b == pv.n)
[1] 8070
mean(abs(pv.b - pv.n))
[1] 0.0001792384
sum((pv.n <= .05) == (pv.n <=.05))
[1] 100000

(2) 在不使用随机测试的情况下，最接近的 $\alpha$ 到 5% 不超过 $n=1000$ 是 $0.046 = 4.6\%,$

2*(1-pbinom(531,1000,.5))
[1] 0.0462912
2*(1-pbinom(530,1000,.5))
[1] 0.05367785

具有离散分布的检验统计量（几乎）不可能达到 5%（或任何其他）显着性水平。因此，许多所谓的“精确”测试使用最大可达到的显着性水平，该水平小于 $\alpha$ （或在您的情况下为 5%）。这就是你所看到的。

这是详细说明的参考：

https://www.jstor.org/stable/2684683?origin=crossref

其它你可能感兴趣的问题

上一篇随着观察次数增加的似然函数下一篇为什么当空值为真时，我观察到混合分布的两样本检验的非均匀分布（负偏态）p 值？