机器算法验证 - 如何：通过自举进行线性回归的预测区间 - 吾爱随笔录

如何：通过自举进行线性回归的预测区间

机器算法验证回归引导程序预测区间

2022-01-27 18:24:18

我很难理解如何使用自举来计算线性回归模型的预测区间。有人可以概述一步一步的过程吗？我通过谷歌搜索，但对我来说没有任何意义。

我确实了解如何使用自举来计算模型参数的置信区间。

2个回答

置信区间考虑了估计的不确定性。预测区间增加了基本的不确定性。R'spredict.lm将为您提供线性模型的预测区间。从那里，您所要做的就是在自举样本上反复运行它。

n <- 100
n.bs <- 30

dat <- data.frame( x<-runif(n), y=x+runif(n) )
plot(y~x,data=dat)


regressAndPredict <- function( dat ) {
  model <- lm( y~x, data=dat )
  predict( model, interval="prediction" )
}

regressAndPredict(dat)

replicate( n.bs, regressAndPredict(dat[ sample(seq(n),replace=TRUE) ,]) )

的结果replicate是一个 3 维数组 ( nx 3x n.bs)。长度 3 维由每个数据元素的拟合值和 95% 预测区间的下限/上限组成。

加里金法

根据您的需要，King、Tomz 和 Wittenberg 提供了一种很酷的方法。它相对容易实现，并且避免了某些估计（例如max(Y)）的引导问题。

我将在这里引用他对基本不确定性的定义，因为它相当不错：

第二种形式的可变性，即等式 1 中随机分量（分布 f）所代表的基本不确定性，是由无数偶然事件（如天气或疾病）导致的，这些事件可能会影响 Y，但不包括在 X 中。即使我们知道参数的精确值（从而消除了估计的不确定性），基本的不确定性会阻止我们准确地预测 Y。

Bootstrapping 不假定对产生样本的潜在父分布的形式有任何了解。传统的经典统计参数估计是基于正态假设。Bootstrap 处理非正态性，在实践中比经典方法更准确。

Bootstrapping 用计算机的原始计算能力代替了严格的理论分析。它是对数据集误差项的抽样分布的估计。Bootstrapping 包括：对数据集进行指定次数的重新采样，计算每个样本的平均值并找到平均值的标准误差。

下面的“R”代码演示了这个概念：

这个实际示例演示了自举的有用性并估计了标准误差。计算置信区间需要标准误差。

让我们假设您有一个倾斜的数据集“a”：

a<-rexp(395, rate=0.1)          # Create skewed data

倾斜数据集的可视化

plot(a,type="l")                # Scatter plot of the skewed data
boxplot(a,type="l")             # Box plot of the skewed data
hist(a)                         # Histogram plot of the skewed data

执行引导过程：

n <- length(a)                  # the number of bootstrap samples should equal the original data set
    xbarstar <- c()                 # Declare the empty set “xbarstar” variable which will be holding the mean of every bootstrap iteration
    for (i in 1:1000) {             # Perform 1000 bootstrap iteration
        boot.samp <- sample(a, n, replace=TRUE) #”Sample” generates the same number of elements as the original data set
    xbarstar[i] <- mean(boot.samp)} # “xbarstar” variable  collects 1000 averages of the original data set
    ## 
    plot(xbarstar)                  # Scatter plot of the bootstrapped data
    boxplot(xbarstar)               # Box plot of the bootstrapped data
    hist(xbarstar)                  # Histogram plot of the bootstrapped data

    meanOfMeans <- mean(xbarstar)
    standardError <- sd(xbarstar)    # the standard error is the standard deviation of the mean of means
    confidenceIntervalAboveTheMean <- meanOfMeans + 1.96 * standardError # for 2 standard deviation above the mean 
    confidenceIntervalBelowTheMean <- meanOfMeans - 1.96 * standardError # for 2 standard deviation above the mean 
    confidenceInterval <- confidenceIntervalAboveTheMean + confidenceIntervalBelowTheMean
    confidenceInterval

其它你可能感兴趣的问题

上一篇什么是收缩？下一篇卡尔曼滤波器何时会比简单的移动平均线提供更好的结果？