数据挖掘 - 交叉验证和超参数调整工作流程 - 吾爱随笔录

交叉验证和超参数调整工作流程

数据挖掘 Python 交叉验证超参数调整

2022-03-08 11:34:17

在阅读了很多关于交叉验证的文章之后，我现在很困惑。我知道交叉验证用于评估模型性能，并用于从多个算法中选择最佳算法。在选择最佳模型后（通过检查 CV 分数的平均值和标准差），我们在整个数据集（训练和验证集）上训练该模型，并将其用于现实世界的预测。

假设在交叉验证中使用的 3 种算法中，我选择了最好的一种。我没有得到的是在这个过程中，我们什么时候调整超参数？我们是在交叉验证过程中使用嵌套交叉验证来调整超参数，还是我们首先通过交叉验证选择性能最佳的算法，然后只为该算法调整超参数？

PS：我将我的数据集拆分为训练集、测试集和有效集，其中我使用训练集和测试集来构建和测试我的模型（这包括所有预处理步骤和嵌套 cv），并使用有效集来测试我的最终模型。

编辑 1下面是执行嵌套交叉验证的两种方法。哪一种是正确的方法，也就是哪种方法不会导致数据泄漏/过度拟合/偏差？

方法 1：同时为多个算法及其超参数执行嵌套 CV：-

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd

# create some regression data
X, y = make_regression(n_samples=1000, n_features=10)

# setup models, variables
results = pd.DataFrame(columns = ['model', 'params', 'mean_mse', 'std_mse'])
models = [SVR(), RandomForestRegressor(random_state = 69)]
params = [{'C':[0.01,0.05]},{'n_estimators':[10,100]}]

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3)

# estimate performance of hyperparameter tuning and model algorithm pipeline
for idx, model in enumerate(models):
    
    # perform hyperparameter tuning
    clf = GridSearchCV(model, params[idx], cv = 3, scoring='neg_mean_squared_error')
    clf.fit(X_train, y_train)

    # this performs a nested CV in SKLearn
    score = cross_val_score(clf, X_train, y_train, cv = 3, scoring='neg_mean_squared_error')
    
    row = {'model' : model,
           'params' : clf.best_params_,
           'mean_mse' : score.mean(),
           'std_mse' : score.std()}

    # append the results in the empty dataframe
    results = results.append(row, ignore_index = True)

方法2：对单个算法执行嵌套CV，它的超参数：-

from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold, train_test_split
import numpy as np

# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

train_x, test_x, train_y ,test_y = train_test_split(X_iris, y_iris, test_size = 0.2, random_state = 69)

# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10], "gamma": [0.01, 0.1]}

# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")

# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "GroupKFold", "LeaveOneOut", "LeaveOneGroupOut", etc.
inner_cv = KFold(n_splits=4, shuffle=True, random_state=69)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=69)
    
# Nested CV with parameter optimization
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
clf.fit(train_x, train_y)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
        
nested_scores_mean = nested_score.mean()
nested_scores_std = nested_score.std()

1个回答

假设你有两个模型，你可以选择，。对于给定的问题，两个模型中的每一个都有一组最佳的超参数（它们的性能尽可能好），比如，。现在说，即模型 1 比模型 2 好。 $m_1$ $m_2$ $m_1^*$ $m_2^*$ $Acc(m_1^*) > Acc(m_2^*)$

现在假设您已经调整了模型 2（或者您碰巧有“好的” hayperparameter），但是您对模型 1 使用了较差的超参数。您最终可能会发现 $Acc(m_1^s) < Acc(m_2^*)$ （即“选择模型 2”），而真正的最佳选择是：“使用调谐模型 1”。

因此，为了做出明智的决定，您需要“调整”两个模型，并将调整后的模型的性能与“最佳超参数”进行比较。我经常做的是定义test和train数据，使用交叉验证（train仅限数据！）调整可能的模型，并根据test集合评估调整模型的性能。

此外，您可能想做特征工程/特征生成。这应该在调整模型之前完成，因为不同的数据可能会导致不同的最佳超参数，例如在随机森林的情况下，每个分割的分割候选者的数量可能取决于特征的数量和质量。

其它你可能感兴趣的问题

上一篇如何学习常识常数？看身体细节下一篇标准化与最小-最大缩放