机器算法验证 - XGBoost 基础学习器中叶值的直观解释是什么？ - 吾爱随笔录

XGBoost 基础学习器中叶值的直观解释是什么？

机器算法验证机器学习助推

2022-03-28 06:43:25

我正在学习 XGBoost。以下是我使用的代码，下面是我构建的 XGBoost 模型中的树 #0 和 #1。

我很难理解叶子值的含义。我发现的一些答案表明数据样本在该叶子上的值是“条件概率”。

但我也在一些叶子上发现了负值。概率怎么可能是负数？

有人可以为叶值提供直观的解释吗？

# prepare dataset
import numpy as np
import pandas as pd

train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None)
test_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test',
                      skiprows = 1, header = None) # Make sure to skip a row for the test set

# since the downloaded data has no header, I need to add the headers manually
col_labels = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 
              'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
             'wage_class']
train_set.columns = col_labels
test_set.columns = col_labels

# 1. replace ' ?' with nan
# 2. drop all nan
train_noNan = train_set.replace(' ?', np.nan).dropna()
test_noNan  = test_set.replace(' ?', np.nan).dropna()

# replace ' <=50K.' with ' <=50K', and ' >50K.' with ' >50K' in wage_class
test_noNan['wage_class'] = test_noNan.wage_class.replace(
  {' <=50K.'  : ' <=50K',
  ' >50K.'    : ' >50K'
  })

# encode training and test dataset together
combined_set = pd.concat([train_noNan, test_noNan], axis=0)
#
for feature in combined_set.columns:
  # cetegorical feature columns will have dtype = object
  if combined_set[feature].dtype == 'object':
    combined_set[feature] = pd.Categorical(combined_set[feature]).codes # replace string with integer; this simply counts the # of unique values in a column and maps it to an integer
combined_set.head()

# separate train and test
final_train = combined_set[:train_noNan.shape[0]]
final_test  = combined_set[train_noNan.shape[0]:]

# separate feature and label
y_train = final_train.pop('wage_class')
y_test  = final_test.pop('wage_class')

import xgboost as xgb
from xgboost import plot_tree
from sklearn.model_selection import GridSearchCV

# XGBoost has built-in CV, which can use early-stopping to prevent overfiting, therefore improve accuracy
## if not using sklearn, I can convert the data into DMatrix, a XGBoost specific data structure for training and testing. It is said DMatrix can improve the efficiency of the algorithm
xgdmat = xgb.DMatrix(final_train, y_train)

our_params = {
  'eta'             : 0.1,      # aka. learning_rate
  'seed'            : 0, 
  'subsample'       : 0.8, 
  'colsample_bytree': 0.8, 
  'objective'       : 'binary:logistic', 
  'max_depth'       :3,         # how many features to use before reach leaf
  'min_child_weight':1} 
# Grid Search CV optimized settings

# create XGBoost object using the parameters
final_gb = xgb.train(our_params, xgdmat, num_boost_round = 432)

import seaborn as sns
sns.set(font_scale = 1.5)

xgb.plot_importance(final_gb)
# after printing the importance of the features, we need to put human insights and try to explain why each feature is important/not important

# visualize the tree
# import matplotlib.pyplot as plt
# xgb.plot_tree(final_gb, num_trees = 0)
# plt.rcParams['figure.figsize'] = [600, 300]  # define the figure size...
# plt.show()
graph_to_save = xgb.to_graphviz(final_gb, num_trees = 0)
graph_to_save.format = 'png'            
graph_to_save.render('tree_0_saved')      # a tree_saved.png will be saved in the root directory

graph_to_save = xgb.to_graphviz(final_gb, num_trees = 1)
graph_to_save.format = 'png'            
graph_to_save.render('tree_1_saved')

下面是倾倒的树#0 和#1。

2个回答

梯度提升机 (GBM)，如 XGBoost，是一种集成学习技术，其中每个基础学习器的结果被组合以生成最终估计。也就是说，在执行二进制分类任务时，默认情况下，XGBoost 将其视为逻辑回归问题。因此，这里看到的原始叶子估计是对数赔率，可能是负数。

复习：在逻辑回归的上下文中，二元响应的均值形式为 $\mu(X) = Pr(Y = 1|X)$ 并与预测变量有关 $X_1, ..., X_p$ 通过logit函数： $\log( \frac{\mu(X)}{1-\mu(X)})$ $=$ $\beta_0 +$ $\beta_1 X_1 +$ $... +$ $\beta_p X_p$ . 因此，要获得概率估计，我们需要使用逆逻辑（即逻辑）链接 $\frac{1}{1 +e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p)}}$ . 除此之外，我们需要记住，提升可以表示为广义加法模型（GAM）。在简单 GAM 的情况下，我们的最终估计是以下形式： $g[\mu(X)]$ $=$ $\alpha +$ $f_1(X_1) +$ $... +$ $f_p(X_p)$ ，在哪里 $g$ 是我们的链接功能和 $f$ 是一组基本基函数（通常是三次样条）。当提升通过时，我们改变 $f$ 而不是某些特定的基函数族，我们使用我们最初提到的单个基学习器！（有关详细信息，请参见 Hastie 等人 2009 年统计学习的要素第 4.4 章“逻辑回归”和第 10.2 章“提升拟合加性模型”。）

因此，在 GBM 的情况下，每个单独树的结果确实组合在一起，但它们不是概率（还），而是在执行逻辑回归时执行逻辑转换之前的分数估计。出于这个原因，个人估计和综合估计自然可能是负面的。负号只是意味着“更少”的机会。好的，谈话很便宜，给我看代码。

假设我们只有两个基础学习者，它们是简单的树桩：

our_params = {
  'eta'             : 0.1,      # aka. learning_rate
  'seed'            : 0, 
  'subsample'       : 0.8, 
  'colsample_bytree': 0.8, 
  'objective'       : 'binary:logistic', 
  'max_depth'       : 1,         # Stumps
  'min_child_weight': 1} 

# create XGBoost object using the parameters
final_gb = xgb.train(our_params, xgdmat, num_boost_round = 2)

我们的目标是预测我们测试集的前四个条目。

xgdmat4 = xgb.DMatrix(final_test.iloc[0:4,:], y_test[0:4])
mypreds4 = final_gb.predict(data = xgdmat4)
# array([0.43447325, 0.46945405, 0.46945405, 0.5424156 ], dtype=float32)

绘制使用的两个（唯一）树：

graph_to_save = xgb.to_graphviz(final_gb, num_trees = 0)
graph_to_save.format = 'png'            
graph_to_save.render('tree_0_saved')

graph_to_save = xgb.to_graphviz(final_gb, num_trees = 1)
graph_to_save.format = 'png'            
graph_to_save.render('tree_1_saved')

给我们以下两个树形图：

基于这些图表，我们可以根据我们的初始样本进行检查：

final_test.iloc[0:4,:][['capital_gain','relationship']]
#       capital_gain  relationship
#0             0             3
#1             0             0
#2             0             0
#3          7688             0

我们可以直接根据logistic函数手动计算自己的估计：

1/(1+ np.exp(-(-0.115036212 + -0.148587108))) # First entry 
# 0.4344732254087043
1/(1+ np.exp(-(-0.115036212 + -0.007299904))) # Second entry
# 0.4694540577007751
1/(1+ np.exp(-(-0.115036212 + -0.007299904))) # Third entry
# 0.4694540577007751
1/(1+ np.exp(-(+0.177371055 + -0.007299904))) # Fourth entry
# 0.5424156005710725

可以很容易地看出，我们的手动估计匹配（最多 7 位）我们直接从predict.

回顾一下，叶子包含来自它们各自的基础学习器的估计，在梯度提升过程发生的函数域上。对于呈现的二元分类任务，使用的链接是 logit，因此这些估计代表对数几率；就对数赔率而言，负值是完全正常的。为了获得概率估计，我们只需使用逻辑函数，它是 logit 函数的倒数。最后，请注意，我们需要首先在梯度提升域中计算我们的最终估计值，然后再将其转换回来。单独转换每个基础学习器的输出然后组合这些输出是错误的，因为所示的线性关系并不（必然）保持在响应变量的域中。

有关 logit 的更多信息，我建议阅读关于逻辑回归中对优势比的简单预测解释的优秀 CV.SE 线程。

如果它是一个回归模型（目标可以是 reg:squarederror），那么叶值就是该树对给定数据点的预测。根据您的目标变量，叶值可以为负数。该数据点的最终预测将是该点所有树中叶值的总和。

如果是分类模型（目标可以是二元：逻辑），那么叶值代表数据点属于正类的概率（如原始分数）。最终的概率预测是通过对所有树中的叶子值（原始分数）求和，然后使用 sigmoid 函数将其在 0 和 1 之间转换来获得的。叶值（原始分数）可以为负数，值 0 实际上表示概率为 1/2。

请在https://xgboost.readthedocs.io/en/latest/parameter.html找到有关参数和输出的更多详细信息

其它你可能感兴趣的问题

上一篇如何计算 Sigma 代数中的集合数下一篇添加线性回归预测器会降低 R 平方