哪一种是进行数据标准化的正确方法 - 在训练测试拆分之前或之后?
拆分前的标准化
normalized_X_features = pd.DataFrame(
StandardScaler().fit_transform(X_features),
columns = X_features.columns
)
x_train, x_test, y_train, y_test = train_test_split(
normalized_X_features,
Y_feature,
test_size=0.20,
random_state=4
)
LR = LogisticRegression(
C=0.01,
solver='liblinear'
).fit(x_train, y_train)
y_test_pred = LR.predict(x_test)
拆分后的归一化
x_train, x_test, y_train, y_test = train_test_split(
X_features,
Y_feature,
test_size=0.20,
random_state=4
)
normalized_x_train = pd.DataFrame(
StandardScaler().fit_transform(x_train),
columns = x_train.columns
)
LR = LogisticRegression(
C=0.01,
solver='liblinear'
).fit(normalized_x_train, y_train)
normalized_x_test = pd.DataFrame(
StandardScaler().fit_transform(x_test),
columns = x_test.columns
)
y_test_pred = LR.predict(normalized_x_test)
到目前为止,我已经看到了这两种方法。