数据挖掘 - 调整梯度提升分类器的超参数并平衡它 - 吾爱随笔录

我不确定它是否是正确的堆栈。也许我应该把我的问题放到交叉验证中。

尽管如此，我还是执行了以下步骤来调整梯度提升模型的超参数：

根据您手头的问题选择损失。我使用默认的一 -偏差
选择尽可能大的 n_estimators（计算上）（例如 600）。
通过网格搜索调整 max_depth、learning_rate、min_samples_leaf 和 max_features。
进一步增加 n_estimators 并在其他参数固定的情况下再次调整 learning_rate。

Scikit-learn 为超参数调优和网格搜索提供了方便的 API。

让我们看一下python代码的代码：

train_gs_X, test_gs_X, train_gs_Y, test_gs_Y = train_test_split(new_features, target, random_state=42,train_size=0.1 )
gb_grid_params = {'learning_rate': [0.1, 0.05, 0.02, 0.01],
              'max_depth': [4, 6, 8],
              'min_samples_leaf': [20, 50,100,150],
              #'max_features': [1.0, 0.3, 0.1] 
              }
print(gb_grid_params)

gb_gs = GradientBoostingClassifier(n_estimators = 600)

clf = grid_search.GridSearchCV(gb_gs,
                               gb_grid_params,
                               cv=2,
                               scoring='roc_auc',
                               verbose = 3, 
                               n_jobs=10);
clf.fit(train_gs_X, train_gs_Y);

当我获得参数值时，我会交叉验证模式以检查过度拟合。

scores = cross_validation.cross_val_score(gb,
                                          all_data, target,
                                          scoring="roc_auc",
                                          n_jobs=6,
                                          cv=3);
"Accuracy: %0.5f (+/- %0.5f)"%(scores.mean(), scores.std())

我的方法是否足够？调整 Boosted Decision Trees 超参数是否正确？你知道如何改进我的调整程序吗？我知道存在像高斯过程这样的方法，它更快，我的意思是可以在更少的步骤中找到最佳的超参数配置，但这不是问题。我想提高作为 ROC auc 测量的性能。

第二个问题是如何处理不平衡的树？

我有两个想法：

使用相同数量的信号和背景（或任何你称之为的）事件。这种方法的问题是跳过大量可能有用的事件。
使用决策树参数class_weight

请参阅下面的代码：

signal_event_no = counts = data[target == 1].count()[0]
background_event_no = counts = data[target == 0].count()[0]
ratio_background_to_signal = float(background_event_no)/signal_event_no
ratio_background_to_signal = numpy.round(ratio_background_to_signal, 3)
train_X, test_X, train_Y, test_Y = train_test_split(new_features, target, random_state=42,train_size=0.5 )              
gb6 = GradientBoostingClassifier( n_estimators=400, learning_rate=0.2,
   class_weight=ratio_background_to_signal, max_depth=6)

还有其他想法吗？

最后的但并非最不重要的。如何更改与 xgboost 相关的超参数调整过程？我应该注意哪些超参数？它与 Gradient Boosted 分类器的设置相同吗？