嵌套交叉验证的实现

机器算法验证 交叉验证 Python scikit-学习
2022-03-11 11:09:03

我试图弄清楚我对嵌套交叉验证的理解是否正确,因此我写了这个玩具示例来看看我是否正确:

import operator
import numpy as np
from sklearn import cross_validation
from sklearn import ensemble
from sklearn.datasets import load_boston

# set random state
state = 1

# load boston dataset
boston = load_boston()

X = boston.data
y = boston.target

outer_scores = []

# outer cross-validation
outer = cross_validation.KFold(len(y), n_folds=3, shuffle=True, random_state=state)
for fold, (train_index_outer, test_index_outer) in enumerate(outer):
    X_train_outer, X_test_outer = X[train_index_outer], X[test_index_outer]
    y_train_outer, y_test_outer = y[train_index_outer], y[test_index_outer]

    inner_mean_scores = []

    # define explored parameter space.
    # procedure below should be equal to GridSearchCV
    tuned_parameter = [1000, 1100, 1200]
    for param in tuned_parameter:

        inner_scores = []

        # inner cross-validation
        inner = cross_validation.KFold(len(X_train_outer), n_folds=3, shuffle=True, random_state=state)
        for train_index_inner, test_index_inner in inner:
            # split the training data of outer CV
            X_train_inner, X_test_inner = X_train_outer[train_index_inner], X_train_outer[test_index_inner]
            y_train_inner, y_test_inner = y_train_outer[train_index_inner], y_train_outer[test_index_inner]

            # fit extremely randomized trees regressor to training data of inner CV
            clf = ensemble.ExtraTreesRegressor(param, n_jobs=-1, random_state=1)
            clf.fit(X_train_inner, y_train_inner)
            inner_scores.append(clf.score(X_test_inner, y_test_inner))

        # calculate mean score for inner folds
        inner_mean_scores.append(np.mean(inner_scores))

    # get maximum score index
    index, value = max(enumerate(inner_mean_scores), key=operator.itemgetter(1))

    print 'Best parameter of %i fold: %i' % (fold + 1, tuned_parameter[index])

    # fit the selected model to the training set of outer CV
    # for prediction error estimation
    clf2 = ensemble.ExtraTreesRegressor(tuned_parameter[index], n_jobs=-1, random_state=1)
    clf2.fit(X_train_outer, y_train_outer)
    outer_scores.append(clf2.score(X_test_outer, y_test_outer))

# show the prediction error estimate produced by nested CV
print 'Unbiased prediction error: %.4f' % (np.mean(outer_scores))

# finally, fit the selected model to the whole dataset
clf3 = ensemble.ExtraTreesRegressor(tuned_parameter[index], n_jobs=-1, random_state=1)
clf3.fit(X, y)

任何想法表示赞赏。

3个回答

UPS,代码是错误的,但以一种非常微妙的方式!

a) 将训练集拆分为内部训练集和测试集是可以的。

b) 问题是最后两行,它反映了对嵌套交叉验证目的的微妙误解。嵌套 CV 的目的不是选择参数,而是对算法的预期准确度进行公正的评估,在这种情况下,无论它们是什么,都ensemble.ExtraTreesRegressor具有最佳超参数的数据

这就是您的代码正确计算的内容:

    print 'Unbiased prediction error: %.4f' % (np.mean(outer_scores))

它使用嵌套 CV 来计算分类器的无偏预测。但请注意,外循环的每次传递都可能生成不同的最佳超参数,正如您在编写以下代码时所知道的那样:

   print 'Best parameter of %i fold: %i' % (fold + 1, tuned_parameter[index])

所以现在你需要一个标准的 CV 循环来选择最终的最佳超参数,使用折叠:

tuned_parameter = [1000, 1100, 1200]
for param in tuned_parameter:

    scores = []

    # normal cross-validation
    kfolds = cross_validation.KFold(len(y), n_folds=3, shuffle=True, random_state=state)
    for train_index, test_index in kfolds:
        # split the training data
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # fit extremely randomized trees regressor to training data
        clf2_5 = ensemble.ExtraTreesRegressor(param, n_jobs=-1, random_state=1)
        clf2_5.fit(X_train, y_train)
        scores.append(clf2_5.score(X_test, y_test))

    # calculate mean score for folds
    mean_scores.append(np.mean(scores))

# get maximum score index
index, value = max(enumerate(mean_scores), key=operator.itemgetter(1))

print 'Best parameter : %i' % (tuned_parameter[index])

这是您的代码,但删除了对内部的引用。

现在最好的参数是tuned_parameter[index],现在您可以clf3像在代码中一样学习最终分类器。

我发布了一个包,可以帮助在 Python 中实现嵌套交叉验证(目前,它仅适用于二进制分类器)。如果你想查看它,它在这里:

https://github.com/JaimeArboleda/nestedcvtraining

这是我的第一个 Python 包,因此欢迎任何意见、建议或批评!

我将其发布为答案,因为嵌套交叉验证是在主函数内部执行的,您不必关心如何实现它。它带有许多选项,对于我认为的许多常见设置来说可能已经足够了。

总结雅克的回答,

模型的无偏误差估计需要嵌套 CV。我们可以通过这种方式比较不同模型的得分。使用此信息,我们可以执行单独的 K 折 CV 循环,以调整所选模型的参数。