为什么 gridsearchCV 拟合失败?

数据挖掘 机器学习 分类 预测建模 随机森林 交叉验证
2022-03-05 09:33:56

我已经在这里提到了这篇文章,但没有答案。

我正在使用随机森林分类器进行二元分类。我的数据集形状是 (977,8),类别比例为 77:23。我的系统有 4 个内核和 8 个逻辑处理器。

由于我的数据集不平衡,我使用了 Balancedbaggingclassifier(以随机森林作为估计器)。

因此,我使用 gridsearchCV 来识别平衡袋分类器模型的最佳参数来训练/拟合模型,然后进行预测。

我的代码如下所示

n_estimators = [100, 300, 500, 800, 1200]
max_samples = [5, 10, 25, 50, 100]
max_features = [1, 2, 5, 10, 13]
hyperbag = dict(n_estimators = n_estimators, max_samples = max_samples, 
              max_features = max_features)
skf = StratifiedKFold(n_splits=10, shuffle=False)
gridbag = GridSearchCV(rf_boruta,hyperbag,cv = skf,scoring='f1',verbose = 3, n_jobs=-1)
gridbag.fit(ord_train_t, y_train)

但是,在 jupyter 控制台中生成的日志具有以下消息,其中 gridsearchcv 分数是nan针对某些 cv 执行的,如下所示。

您可以看到,对于某些 CV 执行,gridscore 是nan. 可以帮帮我吗?而且一直运行了半个多小时还没有输出

为什么gridsearchCV返回nan?

[CV 10/10] END max_features=1, max_samples=25, n_estimators=500;, score=nan total time= 4.5min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.596 total time=10.4min
[CV 5/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.622 total time=10.4min
[CV 6/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.456 total time=10.5min
[CV 9/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.519 total time=10.5min
[CV 5/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 3.3min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 9.9min
[CV 8/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 7.0min
[CV 6/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time=10.7min
[CV 1/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.652 total time=16.4min
[CV 9/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 7.6min
[CV 2/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.528 total time=16.6min
[CV 3/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.571 total time=16.4min
[CV 7/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.553 total time=16.1min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time= 6.7min
[CV 8/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time= 1.7min
[CV 10/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.489 total time=16.0min
[CV 3/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time=18.6min
[CV 1/10] END max_features=1, max_samples=50, n_estimators=100;, score=0.652 total time= 2.4min

更新 - 错误跟踪报告 - 拟合失败原因

he above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<timed exec> in <module>

~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    889                 return results
    890 
--> 891             self._run_search(evaluate_candidates)
    892 
    893             # multimetric is determined here because in the case of a callable

~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
   1390     def _run_search(self, evaluate_candidates):
   1391         """Search all candidates in param_grid"""
-> 1392         evaluate_candidates(ParameterGrid(self.param_grid))
   1393 
   1394 

~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params, cv, more_results)
    836                     )
    837 
--> 838                 out = parallel(
    839                     delayed(_fit_and_score)(
    840                         clone(base_estimator),

~\AppData\Roaming\Python\Python39\site-packages\joblib\parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

~\AppData\Roaming\Python\Python39\site-packages\joblib\parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

~\AppData\Roaming\Python\Python39\site-packages\joblib\_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

~\Anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
    443                     raise CancelledError()
    444                 elif self._state == FINISHED:
--> 445                     return self.__get_result()
    446                 else:
    447                     raise TimeoutError()

~\Anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
    388         if self._exception:
    389             try:
--> 390                 raise self._exception
    391             finally:
    392                 # Break a reference cycle with the exception in self._exception

ValueError: The target 'y' needs to have more than 1 class. Got 1 class instead
1个回答

首先,我想确保你知道你在这里建造什么。您正在使用 100 到 1200 个估计器进行(平衡)装袋,每个估计器都是 300 棵树的随机森林。所以每个模型都建立在100300=30k1200300=360k树木。您的网格搜索有53=125超参数组合,10倍。所以你适合的顺序108个别树。

网格搜索将您的数据分成 10 份,分层以便类平衡应该与整个数据集中的相同。现在平衡套袋设置为仅使用 25 行,但它也使用默认"not minority"方法,这意味着它仅尝试对多数类进行下采样。这两个在一起是不可能的,所以我不确定最终会发生什么(如果我有时间我会在稍后研究)。由于并非所有分数都是 nan,因此它显然有时有效。但是现在稀缺的 25 行被用来训练一个随机森林,因此可以想象,有时那里的一棵树会从其中一个类中选择一个没有示例的包。我怀疑这就是问题所在。

具有BalancedBaggingClassifier单个决策树基础估计器的模型就像一个更高级的随机森林,所以这是我的建议。您也不需要class_weights在树中设置,因为平衡袋已经被平均分配。我希望更大的性能更好max_samples,但即使不改变它,现在你会期望每棵树的每个类大约有 12.5 行。如果你真的想要平衡袋随机森林,那么一定要增加到达每棵树的行数。