机器算法验证 - 估计伯努利试验中“成功”概率所需的样本量 - 吾爱随笔录

估计伯努利试验中“成功”概率所需的样本量

机器算法验证物流数理统计样本量统计能力伯努利分布

2022-03-04 08:52:46

假设一个游戏提供了一个事件，该事件在完成后要么给予奖励，要么什么都不给予。确定是否给予奖励的确切机制是未知的，但我假设使用了一个随机数生成器，如果结果大于某个硬编码值，你就会得到奖励。

如果我想从根本上对程序员用来确定奖励频率的值进行逆向工程（估计为 15-30%），我如何计算我需要的样本数量？

我从这里的“真实概率估计器”部分开始：Checking_whether_a_coin_is_fair，但我不确定我是否走在正确的道路上。我得到了大约 1000 个样本的结果，最大误差为 3%，置信度为 95%。

最终，这就是我要解决的问题：

事件 #1 给予奖励 1.0R，X% 的时间
事件 #2 给予奖励 1.4R，Y% 的时间

我想足够准确地估计 X 和 Y 以确定哪个事件更有效。大样本量是个问题，因为我最多每 20 分钟只能获得 1 个样本。

3个回答

假设您的个人试验是独立的，您观察到二项式变量

X \sim Bin (n, p)

$X \sim \text{Bin}(n,p)$ 你决定在哪里

n

$n$ 并想估计

p

$p$ . 现在的最大似然估计

p

$p$ , 样本分数

\hat{p} = X / n

$\hat{p}=X/n$ 有方差

\frac{p \cdot (1 - p)}{n} \leq \frac{1}{4 n}

$\frac{p\cdot(1-p)}{n}\le \frac1{4n}$ 这是实现的

p = \frac{1}{2}

$p=\frac12$ . 所以标准误是

\leq 1 / \sqrt{4 n} = \frac{1}{2 \sqrt{n}}

$\le 1/\sqrt{4 n} = \frac1{2\sqrt{n}}$ . 大样本的近似置信区间的半宽约为 2 个标准误，因此要保持最大

0.03

$0.03$ ，说，你必须解决

\frac{2}{2 \sqrt{n}} \leq 0.03

$\frac2{2\sqrt{n}} \le 0.03$ 这使

n \geq 1112

$n \ge 1112$ . 现在你可以用同样的方法解决半角的其他要求。如果您知道（或愿意假设）

p

$p$ 有界远离 0.5，你可以用更少的观察来做。

我知道它不那么优雅，但我不得不模拟它。我不仅构建了一个非常简单的模拟，而且运行起来不优雅且缓慢。不过，这已经足够好了。一个优点是，只要一些基础是正确的，它就会告诉我优雅的方法何时失败。

样本量将随着硬编码值的变化而变化。

所以这里是代码：

    #main code
    #want 95% CI to be no more than 3% from 
    # prevalence
    #expect prevalence around 15% to 30%
    #think sample size is ~1000
    
    my_prev <- seq(from=0.15, to=0.30, 
                   by = 0.002)
    
    samp_sizes <- seq(from=400, to=800, by = 1)
    samp_sizes
    
    N_loops <- 2000
    
    store <- matrix(0,  nrow = 
        length(my_prev)*length(samp_sizes),
                    ncol = 3)
    count <- 1
    
    #for each prevalence
    for (i in 1:length(my_prev)) {
         
         #for each sample size
         for(j in 1:length(samp_sizes)){
              
              temp <- 0
              
              for(k in 1:N_loops){
                   
                   #draw samples
                   y <- rbinom(n = 
                         samp_sizes[j],
                               size = 1,
                               prob = 
                                my_prev[i])
                   
                   #compute prevalence, store
                   temp[k] <- mean(y)
                   
              }
              
              #compute 5% and 95% of temp
              width <-  diff(quantile(x = temp, 
                 probs = c(0.05,0.95)))
                             
              #store samp_size, prevalence, and 
              # CI half-width
              store[count, 1] <- my_prev[i]
              store[count, 2] <- samp_sizes[j]
              store[count, 3] <- width[[1]]
              
              count <- count+1
         }
         
    }
    
    
    store2 <- numeric(length(my_prev))
    
    #go through store
    for(i in 1:length(my_prev)){
         #for each prevalence
         #find first CI half-width below 3%
         #store samp_size
         
         idx_p <- which(store[, 1] == 
                   my_prev[i], arr.ind = T)
         idx_p
         
         temp <- store[idx_p, ]
         temp
         
         idx_2 <- which(temp[, 3] <= 0.03*2, 
           arr.ind = T)
         idx_2
         
         temp2 <- temp[idx_2, ]
         temp2
         
         if (length(temp2[,3])>1){
         idx_3 <- which(temp2[, 3]==max(temp2[, 
                  3]), arr.ind = T)
         store2[i] <- temp2[idx_3[1], 2]
         } else {
              store2[i] <- temp2[2]
         }
         
         
    }
    
    
    #plot it
    plot(x=my_prev, y=store2, 
         xlab = "prevalence", 
         ylab = "sample size")
    lines(smooth.spline(x=my_prev,y=store2), 
            col="Red")
    grid()

And here is the plot of sample size vs. prevalence such that uncertainty in 95% CI for prevalence is as close as possible to $\pm$3% without going over it.

[![sample size vs prevalence][1]][1]

Away from 50%, "somewhat less observations" seem to be required, as kjetil suggested.  

I think that you can get a decent estimate of prevalence before 400 samples, and adjust your sampling strategy as you go.  I don't think there should be a jog in the middle, and so you might bump N_loops up to 10e3, and bump the "by" in "my_prev" down to 0.001.

  [1]: https://i.stack.imgur.com/vNYcH.png

似乎您想为事件 #1 估计 $X$ 对于事件 #2，值为 $Y$ . 您可以在这里轻松地使用Hoeffding 不等式来确定界限，或者如果您想要加法而不是乘法界限，您可以使用Chernoff 的界限。

其它你可能感兴趣的问题

上一篇为什么要在逻辑回归中对分类预测变量进行 WOE 转换？下一篇如何训练 HMM 进行分类？