嵌套交叉验证和选择最佳回归模型 - 这是正确的 SKLearn 流程吗?

数据挖掘 Python scikit-学习 交叉验证 模型选择
2021-09-16 08:17:39

如果我理解正确,nested-CV 可以帮助我评估哪种模型和超参数调整过程是最好的。内循环 ( GridSearchCV) 找到最佳超参数,外循环 ( cross_val_score) 评估超参数调整算法。然后,我从外部循环中选择哪个调整/模型组合最小化mse(我正在查看回归分类器)用于我的最终模型测试。

我已经阅读了有关嵌套交叉验证的问题/答案,但还没有看到使用此方法的完整管道的示例。那么,我下面的代码(请忽略实际的超参数范围——这只是举例)和思考过程是否有意义?

from sklearn.cross_validation import cross_val_score, train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.datasets import make_regression

# create some regression data
X, y = make_regression(n_samples=1000, n_features=10)
params = [{'C':[0.01,0.05,0.1,1]},{'n_estimators':[10,100,1000]}]

# setup models, variables
mean_score = []
models = [SVR(), RandomForestRegressor()]

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3)

# estimate performance of hyperparameter tuning and model algorithm pipeline
for idx, model in enumerate(models):
    clf = GridSearchCV(model, params[idx], scoring='mean_squared_error')

    # this performs a nested CV in SKLearn
    score = cross_val_score(clf, X_train, y_train, scoring='mean_squared_error')

    # get the mean MSE across each fold
    mean_score.append(np.mean(score))
    print('Model:', model, 'MSE:', mean_score[-1])

# estimate generalization performance of the best model selection technique
best_idx = mean_score.index(max(mean_score)) # because SKLearn flips MSE signs, max works OK here
best_model = models[best_idx]

clf_final = GridSearchCV(best_model, params[best_idx])
clf_final.fit(X_train, y_train)

y_pred = clf_final.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print('Final Model': best_model, 'Final model RMSE:', rmse)
3个回答

你的不是嵌套交叉验证的例子。

嵌套交叉验证有助于确定随机森林或 SVM 是否更适合您的问题。嵌套 CV 只输出一个分数,它不会像你的代码那样输出一个模型。

这将是嵌套交叉验证的一个示例:

from sklearn.datasets import load_boston
from sklearn.cross_validation import KFold
from sklearn.metrics import mean_squared_error
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import numpy as np

params = [{'C': [0.01, 0.05, 0.1, 1]}, {'n_estimators': [10, 100, 1000]}]
models = [SVR(), RandomForestRegressor()]

df = load_boston()
X = df['data']
y = df['target']

cv = [[] for _ in range(len(models))]
for tr, ts in KFold(len(X)):
    for i, (model, param) in enumerate(zip(models, params)):
        best_m = GridSearchCV(model, param)
        best_m.fit(X[tr], y[tr])
        s = mean_squared_error(y[ts], best_m.predict(X[ts]))
        cv[i].append(s)
print(np.mean(cv, 1))

顺便说一下,有几个想法:

  • 我认为网格搜索n_estimators您的随机森林没有任何目的。显然,越多越好。比如max_depth你想要优化的那种正则化。嵌套 CV 的误差RandomForest要高得多,因为您没有针对正确的超参数进行优化,不一定是因为它是一个更差的模型。
  • 您可能还想尝试梯度增强树。

嵌套交叉验证估计模型的泛化误差,因此它是从候选模型列表及其相关参数网格中选择最佳模型的好方法。原始帖子接近于进行嵌套 CV:与其进行单个训练测试拆分,不如使用第二个交叉验证拆分器。也就是说,将“内部”交叉验证拆分器“嵌套”在“外部”交叉验证拆分器中。

内部交叉验证拆分器用于选择超参数。外部交叉验证拆分器对多个训练-测试拆分的测试误差进行平均。对多个训练-测试拆分的泛化误差进行平均,可以更可靠地估计模型在未见数据上的准确性。

我修改了原始帖子的代码以将其更新为最新版本sklearn(由sklearn.cross_validation取代sklearn.model_selection'mean_squared_error'替换'neg_mean_squared_error'),并且我使用了两个KFold交叉验证拆分器来选择最佳模型。要了解有关嵌套交叉验证的更多信息,请参阅关于嵌套交叉验证sklearn示例

from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import numpy as np

# `outer_cv` creates 3 folds for estimating generalization error
outer_cv = KFold(3)

# when we train on a certain fold, we use a second cross-validation
# split in order to choose hyperparameters
inner_cv = KFold(3)

# create some regression data
X, y = make_regression(n_samples=1000, n_features=10)

# give shorthand names to models and use those as dictionary keys mapping
# to models and parameter grids for that model
models_and_parameters = {
    'svr': (SVR(),
            {'C': [0.01, 0.05, 0.1, 1]}),
    'rf': (RandomForestRegressor(),
           {'max_depth': [5, 10, 50, 100, 200, 500]})}

# we will collect the average of the scores on the 3 outer folds in this dictionary
# with keys given by the names of the models in `models_and_parameters`
average_scores_across_outer_folds_for_each_model = dict()

# find the model with the best generalization error
for name, (model, params) in models_and_parameters.items():
    # this object is a regressor that also happens to choose
    # its hyperparameters automatically using `inner_cv`
    regressor_that_optimizes_its_hyperparams = GridSearchCV(
        estimator=model, param_grid=params,
        cv=inner_cv, scoring='neg_mean_squared_error')

    # estimate generalization error on the 3-fold splits of the data
    scores_across_outer_folds = cross_val_score(
        regressor_that_optimizes_its_hyperparams,
        X, y, cv=outer_cv, scoring='neg_mean_squared_error')

    # get the mean MSE across each of outer_cv's 3 folds
    average_scores_across_outer_folds_for_each_model[name] = np.mean(scores_across_outer_folds)
    error_summary = 'Model: {name}\nMSE in the 3 outer folds: {scores}.\nAverage error: {avg}'
    print(error_summary.format(
        name=name, scores=scores_across_outer_folds,
        avg=np.mean(scores_across_outer_folds)))
    print()

print('Average score across the outer folds: ',
      average_scores_across_outer_folds_for_each_model)

many_stars = '\n' + '*' * 100 + '\n'
print(many_stars + 'Now we choose the best model and refit on the whole dataset' + many_stars)

best_model_name, best_model_avg_score = max(
    average_scores_across_outer_folds_for_each_model.items(),
    key=(lambda name_averagescore: name_averagescore[1]))

# get the best model and its associated parameter grid
best_model, best_model_params = models_and_parameters[best_model_name]

# now we refit this best model on the whole dataset so that we can start
# making predictions on other data, and now we have a reliable estimate of
# this model's generalization error and we are confident this is the best model
# among the ones we have tried
final_regressor = GridSearchCV(best_model, best_model_params, cv=inner_cv)
final_regressor.fit(X, y)

print('Best model: \n\t{}'.format(best_model), end='\n\n')
print('Estimation of its generalization error (negative mean squared error):\n\t{}'.format(
    best_model_avg_score), end='\n\n')
print('Best parameter choice for this model: \n\t{params}'
      '\n(according to cross-validation `{cv}` on the whole dataset).'.format(
      params=final_regressor.best_params_, cv=inner_cv))

你不需要

# this performs a nested CV in SKLearn
score = cross_val_score(clf, X_train, y_train, scoring='mean_squared_error')

GridSearchCV为你做这个。要直观了解网格搜索过程,请尝试使用 GridSearchCV(... , verbose=3)

要提取每个折叠的分数,请参阅scikit-learn 文档中的此示例