线性回归和 k 折交叉验证

数据挖掘 Python scikit-学习 线性回归 交叉验证
2022-02-16 23:40:08

我对数据科学这个话题完全陌生。在以下资源的帮助下,我我已经设法火车数据集上做了一个非常简单和基本的线性回归

我实际执行计算的Python 代码(编写为 iPython 笔记本)如下所示:

### Stage 0: "Import some stuff"
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression

### Stage 1: "Prepare train dataset"
my_train_dataset = pd.read_csv("../train.csv")

### remove categorical cols
only_numerical_train_dataset = my_train_dataset.loc[:, my_train_dataset.dtypes!=object]

### remove 'Id' and 'SalePrice' columns
my_train_dataset_X = only_numerical_train_dataset.drop(['Id','SalePrice'], axis = 1)

### insert median into cells with missing values
print("Before: Number of cells with missing values in train data: " + str(np.sum(np.sum(my_train_dataset_X.isnull()))))
null_values_per_col = np.sum(my_train_dataset_X.isnull(), axis=0)
cols_to_impute = []
for key in null_values_per_col.keys():
    if null_values_per_col.get(key) != 0: 
        cols_to_impute.append(key)
print("Before: Need to replace values in the columns in train data: " + str(cols_to_impute) + "\n")
imputation_val_for_na_cols = dict()
for col in cols_to_impute:
    if (my_train_dataset_X[col].dtype == 'float64' ) or  (my_train_dataset_X[col].dtype == 'int64'):
        #numerical col
        imputation_val_for_na_cols[col] = np.nanmedian(my_train_dataset_X[col]) #with median
for key, val in imputation_val_for_na_cols.items():
    my_train_dataset_X[key].fillna(value= val, inplace = True)
print("After: Number of cells with missing values in train data: " + str(np.sum(np.sum(my_train_dataset_X.isnull()))))
null_values_per_col = np.sum(my_train_dataset_X.isnull(), axis=0)
cols_to_impute = []
for key in null_values_per_col.keys():
    if null_values_per_col.get(key) != 0: 
        cols_to_impute.append(key)
print("After: Need to replace values in the columns in train data: " + str(cols_to_impute) + "\n")

### Stage 2: "Sanity Check - the better the quality, the higher the price?"
plt.scatter(my_train_dataset.OverallQual, my_train_dataset.SalePrice)
plt.xlabel("Overall Quality of the house")
plt.ylabel("Price of the house")
plt.title("Relationship between Price and Quality")
plt.show()

### Stage 3: "Prepare the test dataset"
my_test_dataset = pd.read_csv("../test.csv")

### remove categorical cols
only_numerical_test_dataset = my_test_dataset.loc[:, my_test_dataset.dtypes!=object]

### remove 'Id' column
my_test_dataset_X = only_numerical_test_dataset.drop(['Id'], axis = 1)

### insert median into cells with missing values
print("Before: Number of cells with missing values in test data: " + str(np.sum(np.sum(my_test_dataset_X.isnull()))))
null_values_per_col = np.sum(my_test_dataset_X.isnull(), axis=0)
cols_to_impute = []
for key in null_values_per_col.keys():
    if null_values_per_col.get(key) != 0: 
        cols_to_impute.append(key)
print("Before: Need to replace values in the columns in test data: " + str(cols_to_impute) + "\n")
imputation_val_for_na_cols = dict()
for col in cols_to_impute:
    if (my_test_dataset_X[col].dtype == 'float64' ) or  (my_test_dataset_X[col].dtype == 'int64'):
        #numerical col
        imputation_val_for_na_cols[col] = np.nanmedian(my_test_dataset_X[col]) #with median
for key, val in imputation_val_for_na_cols.items():
    my_test_dataset_X[key].fillna(value= val, inplace = True)
print("After: Number of cells with missing values in test data: " + str(np.sum(np.sum(my_test_dataset_X.isnull()))))
null_values_per_col = np.sum(my_test_dataset_X.isnull(), axis=0)
cols_to_impute = []
for key in null_values_per_col.keys():
    if null_values_per_col.get(key) != 0: 
        cols_to_impute.append(key)
print("After: Need to replace values in the columns in test data: " + str(cols_to_impute) + "\n")

### Stage 4: "Apply the model"
lm = LinearRegression()
lm.fit(my_train_dataset_X, my_train_dataset.SalePrice)

### Stage 5: "Sanity Check - the better the quality, the higher the predicted SalesPrice?"
plt.scatter(my_test_dataset.OverallQual, lm.predict(my_test_dataset_X))
plt.xlabel("Overall Quality of the house in test data")
plt.ylabel("Price of the house in test data")
plt.title("Relationship between Price and Quality in test data")
plt.show()

### Stage 6: "Check the performance of the Prediction"
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lm, my_train_dataset_X,  lm.predict(my_test_dataset_X), cv=10)
print("scores = " + str(scores))

我的问题是:

1. 为什么我在第 6 阶段收到错误以及如何解决?


ValueError Traceback (most recent call last)
<ipython-input-2-700c31f0d410> in <module>()
     85 ### test the performance of the model
     86 from sklearn.model_selection import cross_val_score
---> 87 scores = cross_val_score(lm, my_train_dataset_X,  lm.predict(my_test_dataset_X), cv=10)
     88 print("scores = " + str(scores))
     89 
ValueError: Found input variables with inconsistent numbers of samples: [1460, 1459]

2.我的简单基本线性回归方法有什么根本错误吗?


评论编辑:

@CalZ - 第一条评论:

my_test_dataset_X.shape = (1459, 36)
my_train_dataset_X.shape = (1460, 36)

@CalZ - 第二条评论: 一旦我确定我的方法没有根本错误,我会考虑重构代码。


1个回答
  1. 如错误消息所述,调用cross_val_score失败,因为形状参数的第一维不同(1460 与 1459)。这与 CSV 文件中的行数一致。但是,潜在的问题是您将测试集和训练集混合在一起。您应该只使用测试集调用它:cross_val_score(lm, my_test_dataset_X, lm.predict(my_test_dataset_X), cv=10). 更新:我最初的建议是不正确的,你不能用你自己的预测来验证!您应该保留标记数据的一个子集,以留待计算交叉验证。

  2. 你的不仅仅是线性回归。您的大部分代码负责数据操作(特征选择、数据插补)而不是线性回归。实际上,您正在重用 scikit-learn 的线性回归实现,而不是自己编写代码。如果您想对您的代码片段进行代码审查,也许您应该在http://codereview.stackexchange.com中尝试(我也不知道这是否适合那里,您最好查看他们的帮助中心)。

更新:关于您的代码从数据科学的角度来看是否合理,在我看来(经过快速审查)您正在做合理的事情。有一些可以改进的地方,比如只处理 float64 和 int64(虽然你可以按照这里描述的那样做),只估算 NaN 和 Nones(虽然在某些情况下可能有其他值应该被估算,比如异常值),或者用中位数盲目估算(这是一个安全的决定,但应考虑到每个变量的性质进行评估)。但一般来说似乎还可以。