这是一个简短的模拟,用于检查论文中介绍的随机森林置信区间的覆盖率,当用作预测区间时:
S. Wager、T. Hastie 和 B. Efron。随机森林的置信区间:折刀和无穷小折刀。机器学习研究杂志,15,第 1625-1651 页(2014 年)
数据是根据第 4.3 节中描述的过程模拟的
J.弗里德曼。多元自适应回归样条。统计年鉴。19(1), pp 1-67 (1991)
有六个独立的预测变量,每个都具有分布,以及一个具有分布的独立随机变量。响应变量定义为
friedman <- function(n, p = 6) {
X <- matrix(runif(n*p), nrow = n, ncol = p, dimnames = list(1:n, paste0("x_", 1:p)))
y <- 10*sin(pi*X[, 1]*X[, 2]) + 20*(X[, 3] - 0.5)^2 + 10*X[, 4] + 5*X[, 5] + rnorm(n)
data.frame(cbind(y, X))
}
我们生成大小为 1.000 的训练样本和大小为 100.000 的测试样本。
set.seed(42)
n <- 10^3
training <- friedman(n)
n_tst <- 10^5
test <- friedman(n_tst)
library(ranger)
rf <- ranger(y ~ ., data = training, num.trees = 10^3, keep.inbag = TRUE)
pred <- predict(rf, data = test, type = "se", se.method = "infjack")
y_hat <- pred$predictions
Lower <- y_hat - 1.96 * pred$se
Upper <- y_hat + 1.96 * pred$se
mean((Lower <= test$y) & (test$y <= Upper))
我们设定了95%的标称覆盖率,但模拟得出的覆盖率约为20.2%。问题在于该模拟中获得的低有效预测覆盖率。
笔记:
我们正在计算级别的置信区间,如脚注(第 3 页)中所述:。
正如 usεr11852 在下面的评论中指出的那样,有效覆盖率随着我们增加森林中树木的数量而减少。例子:
num.trees = 50 => effective coverage = 0.94855
num.trees = 100 => effective coverage = 0.76876
num.trees = 150 => effective coverage = 0.68959
num.trees = 200 => effective coverage = 0.56038
num.trees = 250 => effective coverage = 0.55393
num.trees = 300 => effective coverage = 0.32304
num.trees = 350 => effective coverage = 0.55413
num.trees = 400 => effective coverage = 0.26372
num.trees = 450 => effective coverage = 0.26232
num.trees = 500 => effective coverage = 0.23139