数据挖掘 - 用于动态分类器选择集成的网格搜索池分类器 - 吾爱随笔录

用于动态分类器选择集成的网格搜索池分类器

数据挖掘网格搜索

2022-02-15 05:12:54

我想对来自deslib python 包的OLA()（整体局部精度）模型的搜索池分类器超参数进行网格化。

from sklearn.datasets import make_classification


from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.model_selection import cross_val_score


from deslib.dcs.ola import OLA


from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.naive_bayes import GaussianNB

然后：

X , y = make_classification( n_samples = 10000 , n_features = 20 , n_informative = 15 , n_redundant = 5 , random_state = 999 )

model = OLA()

cv = RepeatedStratifiedKFold( n_splits = 10 , n_repeats = 3 , random_state = 999 )

grid = dict()

grid[ 'pool_classifiers' ] = [ [ LogisticRegression() , DecisionTreeClassifier() , GaussianNB() ] ,
                               [ LogisticRegression() , DecisionTreeClassifier() ] ]

search = GridSearchCV( model , grid , scoring = 'accuracy' , cv = cv )

search_results = search.fit( X , y )

但是会引发以下错误消息：

NotFittedError: This LogisticRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

这意味着必须在网格搜索之前拟合池中的模型，但我认为拟合发生在 cv 步骤的每个火车折叠上。

这是否意味着我必须让每个模型都适应训练折叠？

感谢您在此主题上提供帮助。

1个回答

这是我对这个问题的解决方案：

假设我有要交叉验证的分类器池：

grid = dict()

grid[ 'pool_classifiers' ] = [ ( 'pool_01' , [ LogisticRegression() , DecisionTreeClassifier() , GaussianNB() ] ) , 
                               ( 'pool_02' , [ LogisticRegression() , DecisionTreeClassifier() ] ) ]

然后我可以使用以下功能：

def ola_cv( X , y , cv ) :

  scores = dict()

  for pool_classifiers in grid[ 'pool_classifiers' ] :

      scores[ pool_classifiers[ 0 ] ] = list()

  for train_ix , test_ix in cv.split( X , y ) :

    X_train = X[ train_ix , : ]
    y_train = y[ train_ix ]

    X_test = X[ test_ix , : ]
    y_test = y[ test_ix ]

    for pool_classifiers in grid[ 'pool_classifiers' ] : 

      for model in pool_classifiers[ 1 ] :

        model.fit( X_train, y_train )

      ola = OLA( pool_classifiers = pool_classifiers[ 1 ] )

      ola.fit( X_train , y_train )

      y_pred = ola.predict( X_test )

      score = accuracy_score( y_test , y_pred )

      scores[ pool_classifiers[ 0 ] ].append( score )

  for k in scores.keys():

    print( f'pool_classifiers : {k} | accuracy : {np.mean(scores[ k ])} ({np.std(scores[ k ])})')

  return scores

为每个分类器获得 10 * 3 = 30 倍的池平均准确度（和标准差）。

注意：Mayne 网格应该是 ola_cv 函数的参数。

其它你可能感兴趣的问题

上一篇参数和非参数机器学习算法之间的主要区别是什么？下一篇将用于二进制分类的特征集的排名