机器算法验证 - 如果未指定层但函数按层分隔数据集，R 中的引导包如何处理收集引导样本？ - 吾爱随笔录

如果未指定层但函数按层分隔数据集，R 中的引导包如何处理收集引导样本？

机器算法验证 r 引导程序分层

2022-04-05 09:25:22

我目前的理解是 1）如果未指定层，则引导从整个数据集中随机选择替换行。如果数据集实际上是分层的，那么引导通常会返回不均匀的样本大小。相反 2) 如果指定了层，则引导从每个层中随机选择具有替换的行，并且独立于其他层。在这种情况下，boot 将始终返回相同的样本大小。

我的统计数据是来自两个不同组的平均值的比率，因此我使用 Strata 设置了我的引导调用，我相信它会在运行我的函数之前从每个组内收集引导样本。然而，为了仔细检查我的理解，我在有和没有地层的情况下运行了 boot，并惊讶地发现结果非常相似。我是否误解了引导如何处理地层？或者这些方法会很少有不同，例如当 boot 随机收集一组非常不均匀的 bootstrap 样本时？

任何解释为什么这两种方法会产生相似的结果都值得赞赏。

下面是一个示例数据集的 R 代码，它表示我的统计数据的真实值为 10。我在指定和不指定层的情况下运行引导。我发现当指定层时估计的置信区间会稍微窄一些，但我预计会有更大的差异。

require(boot)
    # relative yield takes a matrix or dataframe and finds the ratio
    # of the means: treatmentMean/controlMean. 
    # data structure:
    # first column is strata, control = 1 and treatment = 2
    # second column is response, or the data to be bootstrapped
    rel.yield <- function(D,i) {
      trt <- D[i,1]
      resp <- D[i,2]
      mean(resp[trt==2]) / mean(resp[trt==1])
    }

    # some data that has a true rel.yield of 10
    sub.pop <- matrix(data = c(rep(1,15),rep(2,15),rnorm(15,2,1),rnorm(15,20,1)),
                      nrow = 30, ncol = 2, dimnames = list((1:30),c('trt','resp')))

    # with strata specified
    b <- boot(sub.pop, rel.yield, R = 1000, strata = sub.pop[,1])
    #without strata specified
    c <- boot(sub.pop, rel.yield, R = 1000)

    # note the distributions of t* are similar
    par(mfrow=c(1,2))
    hist(b$t)
    abline(v=mean(b$t))
    hist(c$t)
    abline(v=mean(c$t))

    # and the CI estimates are also similar
    boot.ci(b)
    boot.ci(c)

2个回答

您的理解 1) 和 2) 是正确的；Strata 选项在每个层中独立地制作引导样本，而非阶层选项对所有数据进行引导样本，这意味着引导样本不会包含来自每个层内的相同数量的样本。

至于为什么它们相似；这只是样本量的问题。运行上述示例时（种子 42 和 10,000 个引导复制），我获得

分层引导估计器的标准误差：0.8502739
估计器的标准误差：0.8691592

将样本量增加到 10（不是 30）后，我得到（种子 42）

分层引导估计器的标准误差：1.032356
估计器的标准误差：1.131476

如果有人希望看到代码，这就是我从上面的 svendvn 解释正确答案的方式：

require(boot)
# relative yield takes a matrix or dataframe and finds the ratio
# of the means: treatmentMean/controlMean. 
# data structure:
# first column is strata, control = 1 and treatment = 2
# second column is response, or the data to be bootstrapped
rel.yield <- function(D,i) {
  trt <- D[i,1]
  resp <- D[i,2]
  mean(resp[trt==2]) / mean(resp[trt==1])
}

# some data that has a true rel.yield of 10
set.seed(42)
nTrt <- 15 
sub.pop <- matrix(data = c(rep(1,nTrt),rep(2,nTrt),rnorm(nTrt,2,1),rnorm(nTrt,20,1)),
                  nrow = nTrt*2, ncol = 2, dimnames = list((1:(nTrt*2)),c('trt','resp')))

(estRelYeild <- mean(sub.pop[sub.pop[,1]==2,2])/mean(sub.pop[sub.pop[,1]==1,2]))
# with strata specified
(b <- boot(sub.pop, rel.yield, R = 1E4, strata = sub.pop[,1]))
#without strata specified
(c <- boot(sub.pop, rel.yield, R = 1E4))


# After decreasing the sample size to 10 (not 30), I obtain (with seed 42)
set.seed(42)
nTrt <- 5 
sub.pop <- matrix(data = c(rep(1,nTrt),rep(2,nTrt),rnorm(nTrt,2,1),rnorm(nTrt,20,1)),
                  nrow = nTrt*2, ncol = 2, dimnames = list((1:(nTrt*2)),c('trt','resp')))
(estRelYeild <- mean(sub.pop[sub.pop[,1]==2,2])/mean(sub.pop[sub.pop[,1]==1,2]))
# with strata specified
(b <- boot(sub.pop, rel.yield, R = 1E4, strata = sub.pop[,1]))
#without strata specified
(c <- boot(sub.pop, rel.yield, R = 1E4))

其它你可能感兴趣的问题

上一篇具有大量零值的数据的箱线图下一篇如何为有状态的学习系统确定合适的滞后特征？