数据挖掘 - 使用欠采样对高度不平衡数据进行交叉验证 - 吾爱随笔录

在我的问题中，我正在处理一个高度不平衡的数据集，比如每个正类都有 10000 个负类。训练模型的正常启动方法是对数据进行欠采样。在此过程中，在欠采样数据上训练我们的模型并检查对保留的模型评估（来自原始数据 - 没有欠采样）非常重要。

现在问题来了。KFold-cross 验证实际上将欠采样的训练集拆分为 K 个段，并将其中的一个折叠作为测试集（现在是欠采样的测试集）。我相信对于模型评估，我们实际上需要计算非欠采样测试集的感兴趣指标（对吗？或者我在这里误解了某事？）。如果是，是否可以按如下方式进行交叉验证？

将数据拆分为 K 个段。
将第一个 Segment 作为测试集，对其余的 Folds 进行欠采样（例如 K=1 作为测试集，K=2,3,4,5 作为训练集）
在欠采样的训练数据上拟合模型并计算测试集上感兴趣的指标。
考虑另一个折叠作为测试集（这次例如，K=2），其余的作为训练集（K=1,3,4,5）。对训练集进行欠采样并继续执行步骤 3。
对其余的折叠继续此过程。

当我们对数据进行欠采样时，这是一种正确的交叉验证方式吗？如果是的话，可以用标准库来做吗？

2 月 21 日编辑： 感谢 @Wes，我想知道以下代码是否是 KFold 交叉验证在高度不平衡数据集上的正确实现。

import numpy as np
from statistics import mean, stdev
from collections import Counter

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score

from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import make_pipeline

# initial parameters
RANDOM_STATE = 42
RATIO = 0.033
N_SAMPLES = 1000000
K_FOLD = 5

# Generate the dataset
X, y = make_classification(n_classes=2, class_sep=1, 
                           n_features=10, n_redundant=2,
                           weights=[0.9999, 0.0001], n_informative=5,
                           flip_y=0.0, n_samples=N_SAMPLES,
                           random_state=RANDOM_STATE)

print('Number of samples in each class %s' % Counter(y))

rus = RandomUnderSampler(random_state=RANDOM_STATE, ratio = RATIO)
rfc = RandomForestClassifier(random_state=RANDOM_STATE, n_estimators=100)
pipeline = make_pipeline(rus, rfc)

auc_roc = []
kf = KFold(n_splits=K_FOLD)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    y_pred = pipeline.fit(X_train, y_train).predict_proba(X_test)[:,1]

    auc_roc.append(roc_auc_score(y_test, y_pred))


print('ROC_AUC = {} +/- {}'.format(np.round(mean(auc_roc),4),
                                   np.round(stdev(auc_roc),4)))

这样我就得到了以下结果ROC_AUC = 0.9374 +/- 0.037。