如何使用 scikit-learn将数据矩阵和相应的标签向量随机拆分为X_train
, X_test
, X_val
, y_train
, y_test
, ?y_val
据我所知,sklearn.cross_validation.train_test_split
只能分裂成二不能分裂成三...
如何使用 scikit-learn将数据矩阵和相应的标签向量随机拆分为X_train
, X_test
, X_val
, y_train
, y_test
, ?y_val
据我所知,sklearn.cross_validation.train_test_split
只能分裂成二不能分裂成三...
你可以只用sklearn.model_selection.train_test_split
两次。首先拆分为训练、测试,然后再将训练拆分为验证和训练。像这样的东西:
X_train, X_test, y_train, y_test
= train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val
= train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
在使用 numpy 和 pandas的SO上有一个很好的答案。
命令(参见讨论的答案):
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
为训练、验证和测试集产生 60%、20%、20% 的拆分。
添加到@hh32 的答案,同时尊重任何预定义的比例,例如 (75, 15, 10):
train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10
# train is now 75% of the entire data set
# the _junk suffix means that we drop that variable completely
x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)
# test is now 10% of the initial data set
# validation is now 15% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
print(x_train, x_val, x_test)
你可以使用train_test_split
两次。我认为这是最直接的。
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.25, random_state=1)
这样,train
, val
, test
set 将分别是数据集的 60%, 20%, 20%。