机器算法验证 - GradientBoostClassifier（sklearn）需要很长时间来训练 - 吾爱随笔录

GradientBoostClassifier（sklearn）需要很长时间来训练

机器算法验证分类数据集 scikit-学习助推

2022-03-25 16:48:52

我正在使用具有 61879 个数据点和 102 个特征的数据集。在这个数据集上，Randomforest(sklearn) 需要不到 90 秒的时间来训练 100 个估计器，而 GradientBoostClassifier(sklearn) 则需要永远使用相同数量的估计器来训练。有什么方法可以加快 GradientBoostClassifier 的训练过程？

1个回答

可能有点晚了……但是。

1 - sklearn 的随机森林支持多线程。GradientBoostingClassifier 没有。这可以负责 8 倍的加速。

2 - sklearn 的随机森林适用于特征总数的一个子集（至少，默认情况下），而 GradientBoostingClassifier 使用所有特征来生长每棵树。

如果为 GBC 设置参数 max_features，您可以观察到巨大的加速（但结果不同）。来自 sklearn 文档：

max_features : int, float, string or None, optional (default=None)
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a percentage and int(max_features * 
n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

Choosing max_features < n_features leads to a reduction of variance and 
an increase in bias.

选项 2 是选择/性能问题。至于选项 1，现在可以使用支持多线程的 GBC 实现：xgboost，https ://github.com/dmlc/xgboost 。我将它与 R 一起使用，但 python 实现似乎更易于使用。

编辑。关于各种算法的训练时间，您可能有兴趣更多地了解机器学习方法的复杂性。

其它你可能感兴趣的问题

上一篇因素的存在如何影响回归中其他系数的解释？下一篇在分类问题中使用 Lasso 进行特征选择有什么缺点？