嵌套交叉验证估计模型的泛化误差,因此它是从候选模型列表及其相关参数网格中选择最佳模型的好方法。原始帖子接近于进行嵌套 CV:与其进行单个训练测试拆分,不如使用第二个交叉验证拆分器。也就是说,将“内部”交叉验证拆分器“嵌套”在“外部”交叉验证拆分器中。
内部交叉验证拆分器用于选择超参数。外部交叉验证拆分器对多个训练-测试拆分的测试误差进行平均。对多个训练-测试拆分的泛化误差进行平均,可以更可靠地估计模型在未见数据上的准确性。
我修改了原始帖子的代码以将其更新为最新版本sklearn
(由sklearn.cross_validation
取代sklearn.model_selection
和'mean_squared_error'
替换'neg_mean_squared_error'
),并且我使用了两个KFold
交叉验证拆分器来选择最佳模型。要了解有关嵌套交叉验证的更多信息,请参阅关于嵌套交叉验证sklearn
的示例。
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import numpy as np
# `outer_cv` creates 3 folds for estimating generalization error
outer_cv = KFold(3)
# when we train on a certain fold, we use a second cross-validation
# split in order to choose hyperparameters
inner_cv = KFold(3)
# create some regression data
X, y = make_regression(n_samples=1000, n_features=10)
# give shorthand names to models and use those as dictionary keys mapping
# to models and parameter grids for that model
models_and_parameters = {
'svr': (SVR(),
{'C': [0.01, 0.05, 0.1, 1]}),
'rf': (RandomForestRegressor(),
{'max_depth': [5, 10, 50, 100, 200, 500]})}
# we will collect the average of the scores on the 3 outer folds in this dictionary
# with keys given by the names of the models in `models_and_parameters`
average_scores_across_outer_folds_for_each_model = dict()
# find the model with the best generalization error
for name, (model, params) in models_and_parameters.items():
# this object is a regressor that also happens to choose
# its hyperparameters automatically using `inner_cv`
regressor_that_optimizes_its_hyperparams = GridSearchCV(
estimator=model, param_grid=params,
cv=inner_cv, scoring='neg_mean_squared_error')
# estimate generalization error on the 3-fold splits of the data
scores_across_outer_folds = cross_val_score(
regressor_that_optimizes_its_hyperparams,
X, y, cv=outer_cv, scoring='neg_mean_squared_error')
# get the mean MSE across each of outer_cv's 3 folds
average_scores_across_outer_folds_for_each_model[name] = np.mean(scores_across_outer_folds)
error_summary = 'Model: {name}\nMSE in the 3 outer folds: {scores}.\nAverage error: {avg}'
print(error_summary.format(
name=name, scores=scores_across_outer_folds,
avg=np.mean(scores_across_outer_folds)))
print()
print('Average score across the outer folds: ',
average_scores_across_outer_folds_for_each_model)
many_stars = '\n' + '*' * 100 + '\n'
print(many_stars + 'Now we choose the best model and refit on the whole dataset' + many_stars)
best_model_name, best_model_avg_score = max(
average_scores_across_outer_folds_for_each_model.items(),
key=(lambda name_averagescore: name_averagescore[1]))
# get the best model and its associated parameter grid
best_model, best_model_params = models_and_parameters[best_model_name]
# now we refit this best model on the whole dataset so that we can start
# making predictions on other data, and now we have a reliable estimate of
# this model's generalization error and we are confident this is the best model
# among the ones we have tried
final_regressor = GridSearchCV(best_model, best_model_params, cv=inner_cv)
final_regressor.fit(X, y)
print('Best model: \n\t{}'.format(best_model), end='\n\n')
print('Estimation of its generalization error (negative mean squared error):\n\t{}'.format(
best_model_avg_score), end='\n\n')
print('Best parameter choice for this model: \n\t{params}'
'\n(according to cross-validation `{cv}` on the whole dataset).'.format(
params=final_regressor.best_params_, cv=inner_cv))