我已经为回归训练了一个 XGBoost 模型,其中最大深度为 2。
# Create the ensemble
ensemble_size = 200
ensemble = xgb.XGBRegressor(n_estimators=ensemble_size, n_jobs=4, max_depth=2, learning_rate=0.1,
objective='reg:squarederror')
ensemble.fit(train_x, train_y)
我已经绘制了集合中的第一棵树:
# Plot single tree
plot_tree(ensemble, rankdir='LR')
现在我检索 XGBoost 集成模型中第一个训练样本的叶索引:
ensemble.apply(train_x[:1]) # leaf indices in all 200 base learner trees
数组([[6, 6, 4, 6, 4, 6, 5, 5, 4, 5, 4, 3, 5, 4, 5, 3, 6, 3, 5, 5, 3, 3, 3, 5, 4, 4, 3, 4, 3, 6, 6, 6, 4, 6, 6, 3, 5, 3, 5, 4, 6, 4, 4, 6, 3, 3, 6, 3, 6, 3, 4, 3, 6, 6, 3, 6, 5, 3, 6, 6, 3, 4, 6, 5, 3, 3, 3, 6, 3, 4, 3, 6, 3, 6, 3, 3, 3, 4, 6, 3, 4, 4, 6, 3, 3, 6, 3, 6, 6, 3, 3, 4, 4, 4, 3, 3, 6, 6, 3, 3, 6, 3, 3, 3, 6, 6, 6, 4, 4, 3, 5, 3, 3, 3, 4, 5, 3, 3, 6, 3, 3, 6, 3, 4, 5, 3, 6, 3, 5, 3, 4, 4, 3, 3, 4, 6, 6, 6, 6, 3, 4, 4, 3, 5, 6, 6, 3, 5, 3, 3, 6, 6, 3, 3, 6, 3, 3, 4, 4, 3, 4, 3, 5, 3, 3, 3, 3, 3, 4, 4, 6, 3, 6, 4, 4, 5, 6, 3, 4, 5, 6, 3, 4, 3, 4, 5, 6, 6, 5, 4, 3, 3, 6, 6, 3, 6, 5, 4, 3, 3]], dtype=int32)
这是我的问题:
既然第一棵树有四个叶子节点,那么第一个训练样本怎么会有索引6呢?
在apply()的官方文档中,它说“叶子在 [0; 2**(self.max_depth+1)) 内编号,可能在编号中有间隙。” 所以如果 max_depth 为 2,叶子的编号在 0 到 7 之间。既然深度为 2 的二叉树中只有四个叶子,那么叶子的编号不应该在 [0, 4) 范围内吗?设计背后的原因是什么 ?
相关问题:https ://stackoverflow.com/questions/58585537/how-to-interpret-the-leaf-index-in-xgboost-tree

