get_dump() 叶值和 AUC

数据挖掘 机器学习 决策树 xgboost
2022-02-27 02:08:38

我使用 Xgboost 拟合了一个 AUC 约为 0.73 的模型,并打印了我的最后一个助推器:

booster[599]:
0:[userkn_hometypecnt<22] yes=1,no=2,missing=1
    1:[userkn_60d_opencardniu_days<40] yes=3,no=4,missing=3
        3:[userkn_30d_opencardniu_days<13] yes=7,no=8,missing=7
            7:[userkn_60d_opencardniu_days<24] yes=15,no=16,missing=15
                15:[userkn_timeminperiod_firstday<1029] yes=29,no=30,missing=29
                    29:leaf=0.000352735
                    30:leaf=-0.0100666
                16:[userkn_rate_aopencardniusum_actiondaycnt<0.972506] yes=31,no=32,missing=31
                    31:leaf=0.000398097
                    32:leaf=-0.0129448
            8:[userkn_hometyperate<0.0977183] yes=17,no=18,missing=17
                17:leaf=0.0239075
                18:[userkn_rate_aopencardniusum_actiondaycnt<0.957994] yes=35,no=36,missing=35
                    35:leaf=-0.00201536
                    36:leaf=0.00858442
        4:[userkn_newacitoncntactiondayavg<8.82511] yes=9,no=10,missing=9
            9:[userkn_mingap_importcard_open<297306] yes=19,no=20,missing=19
                19:[userkn_rate_aopencardniusum_actiondaycnt<0.974763] yes=37,no=38,missing=37
                    37:leaf=-0.0138254
                    38:leaf=0.00521038
                20:[userkn_onlinetime_firstday<1961.5] yes=39,no=40,missing=39
                    39:leaf=0.0247849
                    40:leaf=-0.00297016
            10:[userkn_60d_opencardniu_days<59] yes=21,no=22,missing=21
                21:[userkn_rate_repeatcntmaxactionrepeatcnt_actioncnt<0.124787] yes=41,no=42,missing=41
                    41:leaf=0.0101992
                    42:leaf=-0.0222082
                22:leaf=0.0145614
    2:[userkn_hometyperate_firstday<0.25266] yes=5,no=6,missing=5
        5:[userkn_aenterapplyloanpagecntactiondayavg<0.787338] yes=11,no=12,missing=11
            11:[userkn_newacitoncntactiondayavg<8.48678] yes=23,no=24,missing=23
                23:[userkn_worktimeactionrate<0.36514] yes=43,no=44,missing=43
                    43:leaf=-0.0178327
                    44:leaf=0.0168168
                24:leaf=0.0254048
            12:[userkn_newacitontyperate_firstday<0.794737] yes=25,no=26,missing=25
                25:[userkn_newacitoncntactiondayavg<7.14581] yes=47,no=48,missing=47
                    47:leaf=0.0175715
                    48:leaf=-0.00748876
                26:leaf=0.0174804
        6:[userkn_aopencardniurate_firstday<0.0458042] yes=13,no=14,missing=13
            13:[userkn_avgperday_opencardniu_cnt<7.44167] yes=27,no=28,missing=27
                27:leaf=0.00171541
                28:leaf=-0.0229204
            14:leaf=0.00968641

如果我是对的,叶子值就是 logodds 的值,它可以用函数变成概率sigmoid然而,在最后一个助推器中,所有叶子值都变为大约 0.5 概率。

这意味着所有样本将被标记为一半和一半的好/坏案例?那么二元分类的随机猜测没有区别吗?

我是对的还是任何其他意见都非常感谢!

1个回答

您能否澄清“但是在最后一个助推器中,所有叶子值都更改为大约 0.5 概率”的意思?

我的理解是在计算预测概率时,您需要将基本分数(默认 = 0.5)添加到估计的权重参数(叶子分数)中,如下所示:

p^=exp(0.5 + w)1 + exp(0.5 + w)

其中是估计的叶子分数。w

下面是 python API 中默认 xgboost 参数的链接:https ://xgboost.readthedocs.io/en/latest/python/python_api.html

class xgboost.XGBClassifier(max_depth=3, 
      learning_rate=0.1, n_estimators=100, silent=True, 
      objective='binary:logistic', 
      booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, 
      max_delta_step=0, subsample=1, colsample_bytree=1, 
      colsample_bylevel=1, 
      reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, 
      random_state=0, 
      seed=None, missing=None, **kwargs)

base_score: 所有实例的初始预测分数,全局偏差。

这回答了你的问题了吗?