Sklearn 中的训练/测试/验证集拆分

数据挖掘 机器学习 scikit-学习 交叉验证
2021-09-30 19:26:42

如何使用 scikit-learn将数据矩阵和相应的标签向量随机拆分为X_train, X_test, X_val, y_train, y_test, ?y_val

据我所知,sklearn.cross_validation.train_test_split只能分裂成二不能分裂成三...

4个回答

你可以只用sklearn.model_selection.train_test_split两次。首先拆分为训练、测试,然后再将训练拆分为验证和训练。像这样的东西:

 X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.2, random_state=1)

 X_train, X_val, y_train, y_val 
    = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2

在使用 numpy 和 pandas的SO上有一个很好的答案。

命令(参见讨论的答案):

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

为训练、验证和测试集产生 60%、20%、20% 的拆分。

添加到@hh32 的答案,同时尊重任何预定义的比例,例如 (75, 15, 10):

train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

# train is now 75% of the entire data set
# the _junk suffix means that we drop that variable completely
x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)

# test is now 10% of the initial data set
# validation is now 15% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 

print(x_train, x_val, x_test)

你可以使用train_test_split两次。我认为这是最直接的。

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=1)

这样,train, val, testset 将分别是数据集的 60%, 20%, 20%。