数据挖掘 - XGBoost 自己处理多重共线性吗？ - 吾爱随笔录

XGBoost 自己处理多重共线性吗？

数据挖掘特征选择相关性 xgboost gbm

2021-09-18 20:39:27

我目前在具有 21 个特征（从大约 150 个特征的列表中选择）的数据集上使用 XGBoost，然后对它们进行一次热编码以获得约 98 个特征。这 98 个特征中有几个有些多余，例如：一个变量（特征） $A$ 也显示为 $\frac{B}{A}$ 和 $\frac{C}{A}$ .

我的问题是：

提升决策树如何（如果？）处理多重共线性？
如果不加以处理，多重共线性的存在将如何影响预测？

据我了解，该模型正在学习不止一棵树，并且最终预测基于单个预测的“加权和”之类的东西。因此，如果这是正确的，那么 Boosted Decision Trees应该能够处理变量之间的相互依赖关系。

此外，在相关说明中 - XGBoost 中的可变重要性对象如何工作？

4个回答

决策树本质上不受多重共线性的影响。例如，如果您有 2 个 99% 相关的特征，则在决定拆分时，树将只选择其中一个。其他模型（例如逻辑回归）将同时使用这两个特征。

由于提升树使用单独的决策树，它们也不受多重共线性的影响。但是，无论模型的算法如何，从用于训练的任何数据集中删除任何冗余特征都是一种很好的做法。在您的情况下，由于您正在派生新功能，因此您可以使用这种方法，评估每个功能的重要性并仅保留最终模型的最佳功能。

xgboost 模型的重要性矩阵实际上是一个 data.table 对象，第一列列出了提升树中实际使用的所有特征的名称。第二列是增益度量，它暗示了相应特征对模型的相对贡献，通过获取模型中每棵树的每个特征的贡献来计算。与另一个特征相比，该指标的值越高意味着它对于生成预测更重要。

我对此感到好奇并做了一些测试。

我在钻石数据集上训练了一个模型，并观察到变量“x”对于预测钻石价格是否高于某个阈值是最重要的。然后，我添加了多个与 x 高度相关的列，运行相同的模型，并观察到相同的值。

看来，当两列的相关性为1时，xgboost在计算模型之前去掉了多余的列，所以重要性不受影响。但是，当您添加与另一列部分相关的列时，因此系数较低，原始变量 x 的重要性会降低。

例如，如果我添加一个变量 xy = x + y，x 和 y 的重要性都会降低。同样，如果我添加 r=0.4、0.5 或 0.6 的新变量，x 的重要性会降低，尽管只是一点点。

我认为当你计算模型的准确性时，共线性不是提升的问题，因为决策树并不关心使用了哪个变量。然而，它可能会影响变量的重要性，因为删除两个相关变量中的一个不会对模型的准确性产生很大影响，因为另一个包含相似的信息。

library(tidyverse)
library(xgboost)

evaluate_model = function(dataset) {
    print("Correlation matrix")
    dataset %>% select(-cut, -color, -clarity, -price) %>% cor %>% print

    print("running model")
    diamond.model = xgboost(
        data=dataset %>% select(-cut, -color, -clarity, -price) %>% as.matrix, 
        label=dataset$price > 400, 
        max.depth=15, nrounds=30, nthread=2, objective = "binary:logistic",
        verbose=F
        )

    print("Importance matrix")
    importance_matrix <- xgb.importance(model = diamond.model)
    importance_matrix %>% print
    xgb.plot.importance(importance_matrix)
    }

> diamonds %>% head
carat   cut color   clarity depth   table   price   x   y   z
0.23    Ideal   E   SI2 61.5    55  326 3.95    3.98    2.43
0.21    Premium E   SI1 59.8    61  326 3.89    3.84    2.31
0.23    Good    E   VS1 56.9    65  327 4.05    4.07    2.31
0.29    Premium I   VS2 62.4    58  334 4.20    4.23    2.63
0.31    Good    J   SI2 63.3    58  335 4.34    4.35    2.75
0.24    Very Good   J   VVS2    62.8    57  336 3.94    3.96    2.48

根据钻石数据评估模型

给定所有可用的数字变量（克拉、深度、表格、x、y、x），我们预测价格是否高于 400

请注意，x 是最重要的变量，重要性增益得分为 0.375954。

evaluate_model(diamonds)
    [1] "Correlation matrix"
               carat       depth      table           x           y          z
    carat 1.00000000  0.02822431  0.1816175  0.97509423  0.95172220 0.95338738
    depth 0.02822431  1.00000000 -0.2957785 -0.02528925 -0.02934067 0.09492388
    table 0.18161755 -0.29577852  1.0000000  0.19534428  0.18376015 0.15092869
    x     0.97509423 -0.02528925  0.1953443  1.00000000  0.97470148 0.97077180
    y     0.95172220 -0.02934067  0.1837601  0.97470148  1.00000000 0.95200572
    z     0.95338738  0.09492388  0.1509287  0.97077180  0.95200572 1.00000000
    [1] "running model"
    [1] "Importance matrix"
       Feature       Gain      Cover  Frequency
    1:       x 0.37595419 0.54788335 0.19607102
    2:   carat 0.19699839 0.18015576 0.04873442
    3:   depth 0.15358261 0.08780079 0.27767284
    4:       y 0.11645929 0.06527969 0.18813751
    5:   table 0.09447853 0.05037063 0.17151492
    6:       z 0.06252699 0.06850978 0.11786929

在 Diamonds 上训练的模型，将 r=1 的变量添加到 x

在这里，我们添加了一个新列，但是它没有添加任何新信息，因为它与 x 完全相关。

请注意，输出中不存在此新变量。似乎 xgboost 在开始计算之前会自动删除完全相关的变量。x 的重要性增益相同，为 0.3759。

diamonds_xx = diamonds %>%
    mutate(xx = x + runif(1, -1, 1))
evaluate_model(diamonds_xx)
[1] "Correlation matrix"
           carat       depth      table           x           y          z
carat 1.00000000  0.02822431  0.1816175  0.97509423  0.95172220 0.95338738
depth 0.02822431  1.00000000 -0.2957785 -0.02528925 -0.02934067 0.09492388
table 0.18161755 -0.29577852  1.0000000  0.19534428  0.18376015 0.15092869
x     0.97509423 -0.02528925  0.1953443  1.00000000  0.97470148 0.97077180
y     0.95172220 -0.02934067  0.1837601  0.97470148  1.00000000 0.95200572
z     0.95338738  0.09492388  0.1509287  0.97077180  0.95200572 1.00000000
xx    0.97509423 -0.02528925  0.1953443  1.00000000  0.97470148 0.97077180
               xx
carat  0.97509423
depth -0.02528925
table  0.19534428
x      1.00000000
y      0.97470148
z      0.97077180
xx     1.00000000
[1] "running model"
[1] "Importance matrix"
   Feature       Gain      Cover  Frequency
1:       x 0.37595419 0.54788335 0.19607102
2:   carat 0.19699839 0.18015576 0.04873442
3:   depth 0.15358261 0.08780079 0.27767284
4:       y 0.11645929 0.06527969 0.18813751
5:   table 0.09447853 0.05037063 0.17151492
6:       z 0.06252699 0.06850978 0.11786929

在 Diamonds 上训练的模型，为 x + y 添加一列

我们添加一个新列 xy = x + y。这与 x 和 y 部分相关。

请注意，x 和 y 的重要性略有降低，x 的重要性从 0.3759 降低到 0.3592，y 的重要性从 0.116 降低到 0.079。

diamonds_xy = diamonds %>%
    mutate(xy=x+y)
evaluate_model(diamonds_xy)

[1] "Correlation matrix"
           carat       depth      table           x           y          z
carat 1.00000000  0.02822431  0.1816175  0.97509423  0.95172220 0.95338738
depth 0.02822431  1.00000000 -0.2957785 -0.02528925 -0.02934067 0.09492388
table 0.18161755 -0.29577852  1.0000000  0.19534428  0.18376015 0.15092869
x     0.97509423 -0.02528925  0.1953443  1.00000000  0.97470148 0.97077180
y     0.95172220 -0.02934067  0.1837601  0.97470148  1.00000000 0.95200572
z     0.95338738  0.09492388  0.1509287  0.97077180  0.95200572 1.00000000
xy    0.96945349 -0.02750770  0.1907100  0.99354016  0.99376929 0.96744200
              xy
carat  0.9694535
depth -0.0275077
table  0.1907100
x      0.9935402
y      0.9937693
z      0.9674420
xy     1.0000000
[1] "running model"
[1] "Importance matrix"
   Feature       Gain      Cover  Frequency
1:       x 0.35927767 0.52924339 0.15952849
2:   carat 0.17881931 0.18472506 0.04793713
3:   depth 0.14353540 0.07482622 0.24990177
4:   table 0.09202059 0.04714548 0.16267191
5:      xy 0.08203819 0.04706267 0.13555992
6:       y 0.07956856 0.05284980 0.13595285
7:       z 0.06474029 0.06414738 0.10844794

在 Diamonds 数据上训练的模型，修改后添加了冗余列

我们添加三个与 x 相关的新列（r = 0.4、0.5 和 0.6），看看会发生什么。

请注意，x 的重要性降低了，从 0.3759 下降到 0.279。

#' given a vector of values (e.g. diamonds$x), calculate three new vectors correlated to it
#' 
#' Source: https://stat.ethz.ch/pipermail/r-help/2007-April/128938.html
calculate_correlated_vars = function(x1) {

    # create the initial x variable
    #x1 <- diamonds$x

    # x2, x3, and x4 in a matrix, these will be modified to meet the criteria
    x234 <- scale(matrix( rnorm(nrow(diamonds) * 3), ncol=3 ))

    # put all into 1 matrix for simplicity
    x1234 <- cbind(scale(x1),x234)

    # find the current correlation matrix
    c1 <- var(x1234)

    # cholesky decomposition to get independence
    chol1 <- solve(chol(c1))

    newx <-  x1234 %*% chol1 

    # check that we have independence and x1 unchanged
    zapsmall(cor(newx))
    all.equal( x1234[,1], newx[,1] )

    # create new correlation structure (zeros can be replaced with other r vals)
    newc <- matrix( 
    c(1  , 0.4, 0.5, 0.6, 
      0.4, 1  , 0  , 0  ,
      0.5, 0  , 1  , 0  ,
      0.6, 0  , 0  , 1  ), ncol=4 )

    # check that it is positive definite
    eigen(newc)

    chol2 <- chol(newc)

    finalx <- newx %*% chol2 * sd(x1) + mean(x1)

    # verify success
    mean(x1)
    colMeans(finalx)

    sd(x1)
    apply(finalx, 2, sd)

    zapsmall(cor(finalx))
    #pairs(finalx)

    all.equal(x1, finalx[,1])
    finalx
}
finalx = calculate_correlated_vars(diamonds$x)
diamonds_cor = diamonds
diamonds_cor$x5 = finalx[,2]
diamonds_cor$x6 = finalx[,3]
diamonds_cor$x7 = finalx[,4]
evaluate_model(diamonds_cor)
[1] "Correlation matrix"
           carat        depth       table           x           y          z
carat 1.00000000  0.028224314  0.18161755  0.97509423  0.95172220 0.95338738
depth 0.02822431  1.000000000 -0.29577852 -0.02528925 -0.02934067 0.09492388
table 0.18161755 -0.295778522  1.00000000  0.19534428  0.18376015 0.15092869
x     0.97509423 -0.025289247  0.19534428  1.00000000  0.97470148 0.97077180
y     0.95172220 -0.029340671  0.18376015  0.97470148  1.00000000 0.95200572
z     0.95338738  0.094923882  0.15092869  0.97077180  0.95200572 1.00000000
x5    0.39031255 -0.007507604  0.07338484  0.40000000  0.38959178 0.38734145
x6    0.48879000 -0.016481580  0.09931705  0.50000000  0.48835896 0.48487442
x7    0.58412252 -0.013772440  0.11822089  0.60000000  0.58408881 0.58297414
                 x5            x6            x7
carat  3.903125e-01  4.887900e-01  5.841225e-01
depth -7.507604e-03 -1.648158e-02 -1.377244e-02
table  7.338484e-02  9.931705e-02  1.182209e-01
x      4.000000e-01  5.000000e-01  6.000000e-01
y      3.895918e-01  4.883590e-01  5.840888e-01
z      3.873415e-01  4.848744e-01  5.829741e-01
x5     1.000000e+00  5.925447e-17  8.529781e-17
x6     5.925447e-17  1.000000e+00  6.683397e-17
x7     8.529781e-17  6.683397e-17  1.000000e+00
[1] "running model"
[1] "Importance matrix"
   Feature       Gain      Cover  Frequency
1:       x 0.27947762 0.51343709 0.09748172
2:   carat 0.13556427 0.17401365 0.02680747
3:      x5 0.13369515 0.05267688 0.18155971
4:      x6 0.12968400 0.04804315 0.19821284
5:      x7 0.10600238 0.05148826 0.16450041
6:   depth 0.07087679 0.04485760 0.11251015
7:       y 0.06050565 0.03896716 0.08245329
8:   table 0.04577057 0.03135677 0.07554833
9:       z 0.03842355 0.04515944 0.06092608

陈天琪（2018）给出了答案。

这种差异对特征重要性分析中的极端情况有影响：相关特征。想象两个完全相关的特征，特征 A 和特征 B。对于一棵特定的树，如果算法需要其中之一，它将随机选择（在 boosting 和 Random Forests™ 中都是如此）。

但是，在 Random Forests™ 中，将对每棵树进行这种随机选择，因为每棵树都独立于其他树。因此，近似地，根据您的参数，50% 的树会选择特征 A，而另外 50% 的树会选择特征 B。所以 A 和 B 中包含的信息的重要性（这是相同的，因为它们完全相关) 在 A 和 B 中被稀释。所以你不会轻易知道这些信息对于预测你想要预测的内容很重要！当你有 10 个相关特征时，情况会更糟……

在 boosting 中，当算法已经学习到特征和结果之间的特定联系时，它会尽量不重新关注它（理论上它就是这样，现实并不总是那么简单）。因此，所有的重要性都将集中在功能 A 或功能 B 上（但不是两者兼而有之）。您会知道，一个特征在观测值和标签之间的联系中起着重要作用。如果您需要了解所有这些特征，您仍然可以搜索与被检测为重要的特征相关的特征。

总而言之，Xgboost 不会随机使用每棵树中的相关特征，而随机森林模型会遇到这种情况。

参考：

陈天琪、Michaël Benesty、童贺。2018.“使用 Xgboost 了解您的数据集。” https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html#numeric-vs-categorical-variables。

关于 Sandeep 答案的评论：假设您的 2 个特征是高度共线的（比如 99% 的时间）确实在每次拆分时只选择了 1 个特征，但对于下一次拆分，xgb 可以选择另一个特征。因此，xgb 特征排名可能会将 2 个共线特征平均排名。如果没有一些先验知识或其他特征处理，您几乎无法从这个提供的排名中检测到这 2 个特征是共线的。

现在，至于输出 xgboost 的相对重要性，它应该与 sklearn 梯度提升树排名非常相似（或者可能完全相似）。有关说明，请参见此处。

其它你可能感兴趣的问题

上一篇train_test_split() 错误：发现输入变量的样本数不一致下一篇如何在线性回归中强制权重为非负数