我正在仅具有数字特征的数据集上训练逻辑回归模型。我执行了以下步骤:-
1.)热图以消除变量之间的共线性
2.) 使用 StandarScaler 进行缩放
3.) 对于我的基线模型,拆分后的交叉验证
4.) 拟合和预测
以下是我的代码: -
# SPLITTING
train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2, random_state =
69)
#MODEL INSTANCE
model = LogisticRegression(random_state = 69)
# SCALING
train_x2 = train_x.copy(deep = True)
test_x2 = test_x.copy(deep = True)
s_scaler = StandardScaler()
s_scaler.fit(train_x2)
s_scaled_train = s_scaler.transform(train_x2)
s_scaled_test = s_scaler.transform(test_x2)
# BASELINE MODEL
cross_val_model2 = -1 * cross_val_score(model, s_scaled_train, train_y, cv = 5,
n_jobs = -1, scoring = 'neg_mean_squared_error')
s_score = cross_val_model2.mean()
# FITTING AND PREDICTING
model.fit(s_scaled_train, train_y)
pred = model.predict(s_scaled_test)
mse = mean_squared_error(test_y, pred)
CV 分数为0.06
,拟合预测后的分数为0.23
。我觉得这很奇怪,因为 CV 是衡量模型执行情况的指标。所以我至少应该得到一个等于 CV 分数的分数,对吧?