数据挖掘 - 使用 XGBClassifier() 在 xgboost.train() 中重现截止 - 吾爱随笔录

（从https://stackoverflow.com/questions/43415724/reproduce-cutoff-in-xgboost-train-with-xgbclassifier交叉发布）

我已经让 xgboost 使用 xgboost.train() 生成良好的预测。

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=.6)
xgtrain = xgb.DMatrix(X_train, y_train)

param = {'max_depth':7, 'silent':1}
bst = xgb.train(param, xgtrain, num_boost_round=2)
y_pred = bst.predict(xgtest)
y_pred = [1. if y_cont > .28  else 0. for y_cont in y_pred]
y_true = y_test

这种方法并没有产生好的结果（我试图最大化 f1 分数），直到我意识到在为输出设置阈值时 f1 分数显着增加。这个阈值原来是 0.28。以下是我设置截止值并转换为 0 和 1 之前的一些预测：

[ 0.25447303  0.25383738  0.24621713 ...,  0.24621713  0.24621713 0.24621713]

但是现在我想调整我的参数（使用 GridSearchCV()），这意味着我需要使用 XGBClassifier() 重现我在上面的 xgboost.train() 中所做的事情。

我意识到事情可能会变得棘手，因为 xgboost.train() 中的（默认）目标函数没有，而对于 XGBClassifier() 它是“二进制：逻辑”。XGBClassifier() 返回类而不是概率，这在大多数情况下很有用，但不是在这里。我用 XGBClassifier() 尝试了 predict_proba() ，然后设置了一个截止值，但它似乎毫无用处，因为我得到的概率非常接近 0 和 1：

[[  9.99445975e-01   5.54045662e-04]
 [  9.89062011e-01   1.09380139e-02]
 [  9.95234787e-01   4.76523908e-03]

我怎样才能完成下面的代码相当于 xgboost.train() 但使用 XGBClassifier？当我尝试没有截止的 XGBClassifier 时，我会得到一个可怕的 f1 分数。

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=.6)
rf = XGBClassifier(max_depth=7, learning_rate=0.1, n_estimators=100, silent=True, objective='binary:logistic', nthread=-1, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, seed=0, missing=None)
rf = rf.fit(X_train, y_train)