机器算法验证 - 如何从引导回归中获得系数的 p 值？ - 吾爱随笔录

如何从引导回归中获得系数的 p 值？

机器算法验证 r 回归 p 值引导程序

2022-03-26 22:04:14

从罗伯特卡巴科夫的Quick-R我有

# Bootstrap 95% CI for regression coefficients 
library(boot)
# function to obtain regression weights 
bs <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample 
  fit <- lm(formula, data=d)
  return(coef(fit)) 
} 
# bootstrapping with 1000 replications 
results <- boot(data=mtcars, statistic=bs, 
     R=1000, formula=mpg~wt+disp)

# view results
results
plot(results, index=1) # intercept 
plot(results, index=2) # wt 
plot(results, index=3) # disp 

# get 95% confidence intervals 
boot.ci(results, type="bca", index=1) # intercept 
boot.ci(results, type="bca", index=2) # wt 
boot.ci(results, type="bca", index=3) # disp

如何获得引导回归系数的 p 值 $H_0:\, b_j=0$ ？

3个回答

如果我错了，社区和@BrianDiggs 可能会纠正我，但我相信您可以获得如下问题的 p 值。两侧检验的 p 值定义为

2 * min [P (X \leq x | H_{0}), P (X \geq x | H_{0})]

$2*\text{min}[P(X \le x|H_0),P(X \ge x|H_0)]$

因此，如果您按大小对自举系数进行排序，然后确定比例大于零和小于零，则最小比例乘以 2 应该会给您一个 p 值。

在这种情况下，我通常使用以下功能：

twosidep<-function(data){
  p1<-sum(data>0)/length(data)
  p2<-sum(data<0)/length(data)
  p<-min(p1,p2)*2
  return(p)
}

只是另一个有点简单的变体，但我认为在没有明确使用库的情况下传递信息，boot这可能会使一些人对它使用的语法感到困惑。

我们有一个线性模型：$y = X \beta + \epsilon$, $\quad \epsilon \sim N(0,\sigma^2)$ $y = X \beta + \epsilon$ , $\quad \epsilon \sim N(0,\sigma^2)$

以下是该线性模型的参数引导程序，这意味着我们不会重新采样原始数据，但实际上我们会从拟合模型中生成新数据。此外，我们假设回归系数 $\beta$ 的自举分布是对称的，并且是平移不变的。（非常粗略地说，我们可以通过影响其属性来移动它的轴）背后的想法是 $\beta$ 的波动是由于 $\epsilon$ 造成的，因此如果有足够的样本，它们应该提供一个很好的近似值$\beta$的真实分布。和之前一样，我们再次测试 $H_0 : 0 = \beta_j$ 并且我们将 p 值定义为 $\beta$ is symmetric and that is translation invariant. (Very roughly speaking that we can move the axis of it with affecting its properties) The idea behind is that the fluctuations in the $\beta$ 's are due to $\epsilon$ and therefore with enough samples they should provide a good approximation of the $\beta$ 's. As before we test again $H_0 : 0 = \beta_j$ and we defined our p-values as “在给定数据概率分布的零假设的情况下，结果将与观察到的结果一样极端或更极端的概率”（在这种情况下观察到的结果是 $\beta$ 的我们得到了我们的原始模型）。所以这里是： $\beta$ 's we got for our original model). So here goes:

# Sample Size
N           <- 2^12;
# Linear Model to Boostrap          
Model2Boot  <- lm( mpg ~ wt + disp, mtcars)
# Values of the model coefficients
Betas       <- coefficients(Model2Boot)
# Number of coefficents to test against
M           <- length(Betas)
# Matrix of M columns to hold Bootstraping results
BtStrpRes   <- matrix( rep(0,M*N), ncol=M)

for (i in 1:N) {
# Simulate data N times from the model we assume be true
# and save the resulting coefficient in the i-th row of BtStrpRes
BtStrpRes[i,] <-coefficients(lm(unlist(simulate(Model2Boot)) ~wt + disp, mtcars))
}

#Get the p-values for coefficient
P_val1 <-mean( abs(BtStrpRes[,1] - mean(BtStrpRes[,1]) )> abs( Betas[1]))
P_val2 <-mean( abs(BtStrpRes[,2] - mean(BtStrpRes[,2]) )> abs( Betas[2]))
P_val3 <-mean( abs(BtStrpRes[,3] - mean(BtStrpRes[,3]) )> abs( Betas[3]))

#and some parametric bootstrap confidence intervals (2.5%, 97.5%) 
ConfInt1 <- quantile(BtStrpRes[,1], c(.025, 0.975))
ConfInt2 <- quantile(BtStrpRes[,2], c(.025, 0.975))
ConfInt3 <- quantile(BtStrpRes[,3], c(.025, 0.975))

如前所述，整个想法是你有 $\beta$ 的自举分布近似于它们的真实分布。（很明显，这段代码针对速度和可读性进行了优化。:)） $\beta$ 's approximates their true one. (Clearly this code is optimized for speed but for readability. :) )

引导程序可用于计算 $p$-values，但它需要对您的代码进行重大更改。由于我对 RI 不熟悉，因此只能给您一个参考，您可以在其中查找您需要做什么：（Davison and Hinkley 1997）的第 4 章。 $p$ -values, but it would need a substantial change to your code. As I am not familiar with R I can only give you a reference in which you can look up what you would need to do: chapter 4 of (Davison and Hinkley 1997).

Davison, AC 和 Hinkley, DV 1997。引导方法及其应用。剑桥：剑桥大学出版社。

其它你可能感兴趣的问题

上一篇如何向外行解释什么是无偏估计量？下一篇Python中的主成分分析和回归