数据挖掘 - xgboost 无法识别完美拟合的回归线 - 吾爱随笔录

对于数据集，我想使用 xgboost 来优化预测的集成，而不是仅仅使用它们的算术平均值进行组合。我发现 xgboost 生成的预测比模型可以选择组合 $n$ $n$

我不知道为什么会这样。为了说明我的观察，我创建了下面的玩具数据集。人工目标变量由和两个解释变量之间的确定关系和， xgboost 可以做出完美的预测，但事实并非如此。线性模型很容易做到。由于这是我能想到的最简单的多元线性回归模型，而 xgboost 失败了，我想知道其中的含义。

y = \frac{x_{1} + x_{2}}{2} with x_{1}, x_{2} \sim N (0, 1)

$y = \frac{x_1+x_2}{2} \, \,\mbox{with } x_1, x_2 \sim N(0,1)$

y

$y$

x_{1}

$x_1$

x_{2}

$x_2$

为什么会这样？回归树模型的局限性是什么？
如果 xgboost 不能重现 MSE 最小化算术平均值作为最佳组合机制，为什么还要使用 xgboost 进行预测的堆叠和集成？

请注意，xgboost 的参数不影响这一点。我尝试了很多参数设置，结果从来都不是完美的。

数据生成

library(tidyverse)
library(xgboost)
n <- 1000
param0 <- list("objective"  = "reg:linear", "eval_metric" = "rmse")
set.seed(1)
df <- tibble(x1 = rnorm(n), x2 = rnorm(n), y = (x1+x2)/2)

xgboost

xgtrain <- xgb.DMatrix(as.matrix(df[1:900,c("x1","x2")]), label = df$y[1:900], missing = NA)
xgtest <- xgb.DMatrix(as.matrix(df[901:1000,c("x1","x2")]), missing = NA)
#Crossvalidation just to illustrate that the algorithm 
#learns something that is not correct since the test data 
#cannot be forecasted with 0 error. 
#xgb.cv(nrounds = 100,nfold = 10, params = param0, data = xgtrain)  
#nrounds and other parameters do not not get you to the prefect forecast
model <- xgb.train(nrounds = 100, params = param0, data = xgtrain)  
preds_xgb <- predict(model, xgtest)
#no perfect forecasts
sqrt(mean((preds_xgb-df$y[901:1000])^2))
0.04654448

线性回归

model <- lm(y ~ x1+x2, data = df[1:900,])
#0.5 and 0.5 for x1 and x2 as expected
model$coefficients 
preds_lm <- predict(model, df[901:1000,c("x1","x2")])
#perfect forecasts
sqrt(mean((preds_lm-df$y[901:1000])^2))
1.389314e-15