如何正确拆分将我的数据集拆分为训练、测试和验证?

数据挖掘 机器学习 Python scikit-学习
2022-02-13 23:25:21

我试图将我的数据集分成 70% 的训练、15% 的测试和 15% 的验证。

train_X, test_X, train_Y, test_Y = train_test_split(data, labels, test_size=0.3, train_size=0.7,random_state=1,stratify = labels)

test_X, val_X, test_Y, val_Y = train_test_split(test_X, test_Y, test_size=0.5,
                                                    random_state=1,stratify = labels)

但我不确定这段代码是否将测试集分成两半。此外,我不断收到此错误:

     29 def main():
     30     data, labels = load_data()
---> 31     train_X, train_Y, val_X, val_Y, test_X, test_Y = process_data(data, labels)
     32 
     33     best_model, best_k = select_knn_model(train_X, val_X, train_Y, val_Y)

/tmp/ipykernel_50/3409802801.py in process_data(data, labels)
     45     X_counts = vectorizer.fit_transform(train_X)
     46     X_count = vectorizer.transform(test_X)
---> 47     Xval = vectorizer.transform(Val_X)
     48     # Return the training, validation, and test set inputs and labels
     49 

NameError: name 'Val_X' is not defined

我该如何解决?

1个回答

不要test将第二组分成两半train_test_split相反,首先将整个数据拆分traintest设置。然后将train集合拆分为trainvalidation集合,如下所示。

X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.2, random_state=1)

 X_train, X_val, y_train, y_val 
    = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2

关于错误,您已val_X在第二次拆分中定义,但您Val_X在使用矢量化器时正在使用。只需将大写改正为小写就可以了!