在阅读了很多关于交叉验证的文章之后,我现在很困惑。我知道交叉验证用于评估模型性能,并用于从多个算法中选择最佳算法。在选择最佳模型后(通过检查 CV 分数的平均值和标准差),我们在整个数据集(训练和验证集)上训练该模型,并将其用于现实世界的预测。
假设在交叉验证中使用的 3 种算法中,我选择了最好的一种。我没有得到的是在这个过程中,我们什么时候调整超参数?我们是在交叉验证过程中使用嵌套交叉验证来调整超参数,还是我们首先通过交叉验证选择性能最佳的算法,然后只为该算法调整超参数?
PS:我将我的数据集拆分为训练集、测试集和有效集,其中我使用训练集和测试集来构建和测试我的模型(这包括所有预处理步骤和嵌套 cv),并使用有效集来测试我的最终模型。
编辑 1下面是执行嵌套交叉验证的两种方法。哪一种是正确的方法,也就是哪种方法不会导致数据泄漏/过度拟合/偏差?
方法 1:同时为多个算法及其超参数执行嵌套 CV:-
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
# create some regression data
X, y = make_regression(n_samples=1000, n_features=10)
# setup models, variables
results = pd.DataFrame(columns = ['model', 'params', 'mean_mse', 'std_mse'])
models = [SVR(), RandomForestRegressor(random_state = 69)]
params = [{'C':[0.01,0.05]},{'n_estimators':[10,100]}]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3)
# estimate performance of hyperparameter tuning and model algorithm pipeline
for idx, model in enumerate(models):
# perform hyperparameter tuning
clf = GridSearchCV(model, params[idx], cv = 3, scoring='neg_mean_squared_error')
clf.fit(X_train, y_train)
# this performs a nested CV in SKLearn
score = cross_val_score(clf, X_train, y_train, cv = 3, scoring='neg_mean_squared_error')
row = {'model' : model,
'params' : clf.best_params_,
'mean_mse' : score.mean(),
'std_mse' : score.std()}
# append the results in the empty dataframe
results = results.append(row, ignore_index = True)
方法2:对单个算法执行嵌套CV,它的超参数:-
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold, train_test_split
import numpy as np
# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
train_x, test_x, train_y ,test_y = train_test_split(X_iris, y_iris, test_size = 0.2, random_state = 69)
# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10], "gamma": [0.01, 0.1]}
# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")
# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "GroupKFold", "LeaveOneOut", "LeaveOneGroupOut", etc.
inner_cv = KFold(n_splits=4, shuffle=True, random_state=69)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=69)
# Nested CV with parameter optimization
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
clf.fit(train_x, train_y)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
nested_scores_mean = nested_score.mean()
nested_scores_std = nested_score.std()