数据挖掘 - 线性回归和 k 折交叉验证 - 吾爱随笔录

线性回归和 k 折交叉验证

数据挖掘 Python scikit-学习线性回归交叉验证

2022-02-16 23:40:08

我对数据科学这个话题完全陌生。在以下资源的帮助下，我想我已经设法在火车数据集上做了一个非常简单和基本的线性回归：

我实际执行计算的Python 代码（编写为 iPython 笔记本）如下所示：

### Stage 0: "Import some stuff"
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression

### Stage 1: "Prepare train dataset"
my_train_dataset = pd.read_csv("../train.csv")

### remove categorical cols
only_numerical_train_dataset = my_train_dataset.loc[:, my_train_dataset.dtypes!=object]

### remove 'Id' and 'SalePrice' columns
my_train_dataset_X = only_numerical_train_dataset.drop(['Id','SalePrice'], axis = 1)

### insert median into cells with missing values
print("Before: Number of cells with missing values in train data: " + str(np.sum(np.sum(my_train_dataset_X.isnull()))))
null_values_per_col = np.sum(my_train_dataset_X.isnull(), axis=0)
cols_to_impute = []
for key in null_values_per_col.keys():
    if null_values_per_col.get(key) != 0: 
        cols_to_impute.append(key)
print("Before: Need to replace values in the columns in train data: " + str(cols_to_impute) + "\n")
imputation_val_for_na_cols = dict()
for col in cols_to_impute:
    if (my_train_dataset_X[col].dtype == 'float64' ) or  (my_train_dataset_X[col].dtype == 'int64'):
        #numerical col
        imputation_val_for_na_cols[col] = np.nanmedian(my_train_dataset_X[col]) #with median
for key, val in imputation_val_for_na_cols.items():
    my_train_dataset_X[key].fillna(value= val, inplace = True)
print("After: Number of cells with missing values in train data: " + str(np.sum(np.sum(my_train_dataset_X.isnull()))))
null_values_per_col = np.sum(my_train_dataset_X.isnull(), axis=0)
cols_to_impute = []
for key in null_values_per_col.keys():
    if null_values_per_col.get(key) != 0: 
        cols_to_impute.append(key)
print("After: Need to replace values in the columns in train data: " + str(cols_to_impute) + "\n")

### Stage 2: "Sanity Check - the better the quality, the higher the price?"
plt.scatter(my_train_dataset.OverallQual, my_train_dataset.SalePrice)
plt.xlabel("Overall Quality of the house")
plt.ylabel("Price of the house")
plt.title("Relationship between Price and Quality")
plt.show()

### Stage 3: "Prepare the test dataset"
my_test_dataset = pd.read_csv("../test.csv")

### remove categorical cols
only_numerical_test_dataset = my_test_dataset.loc[:, my_test_dataset.dtypes!=object]

### remove 'Id' column
my_test_dataset_X = only_numerical_test_dataset.drop(['Id'], axis = 1)

### insert median into cells with missing values
print("Before: Number of cells with missing values in test data: " + str(np.sum(np.sum(my_test_dataset_X.isnull()))))
null_values_per_col = np.sum(my_test_dataset_X.isnull(), axis=0)
cols_to_impute = []
for key in null_values_per_col.keys():
    if null_values_per_col.get(key) != 0: 
        cols_to_impute.append(key)
print("Before: Need to replace values in the columns in test data: " + str(cols_to_impute) + "\n")
imputation_val_for_na_cols = dict()
for col in cols_to_impute:
    if (my_test_dataset_X[col].dtype == 'float64' ) or  (my_test_dataset_X[col].dtype == 'int64'):
        #numerical col
        imputation_val_for_na_cols[col] = np.nanmedian(my_test_dataset_X[col]) #with median
for key, val in imputation_val_for_na_cols.items():
    my_test_dataset_X[key].fillna(value= val, inplace = True)
print("After: Number of cells with missing values in test data: " + str(np.sum(np.sum(my_test_dataset_X.isnull()))))
null_values_per_col = np.sum(my_test_dataset_X.isnull(), axis=0)
cols_to_impute = []
for key in null_values_per_col.keys():
    if null_values_per_col.get(key) != 0: 
        cols_to_impute.append(key)
print("After: Need to replace values in the columns in test data: " + str(cols_to_impute) + "\n")

### Stage 4: "Apply the model"
lm = LinearRegression()
lm.fit(my_train_dataset_X, my_train_dataset.SalePrice)

### Stage 5: "Sanity Check - the better the quality, the higher the predicted SalesPrice?"
plt.scatter(my_test_dataset.OverallQual, lm.predict(my_test_dataset_X))
plt.xlabel("Overall Quality of the house in test data")
plt.ylabel("Price of the house in test data")
plt.title("Relationship between Price and Quality in test data")
plt.show()

### Stage 6: "Check the performance of the Prediction"
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lm, my_train_dataset_X,  lm.predict(my_test_dataset_X), cv=10)
print("scores = " + str(scores))

我的问题是：

1. 为什么我在第 6 阶段收到错误以及如何解决？

ValueError Traceback (most recent call last)
<ipython-input-2-700c31f0d410> in <module>()
     85 ### test the performance of the model
     86 from sklearn.model_selection import cross_val_score
---> 87 scores = cross_val_score(lm, my_train_dataset_X,  lm.predict(my_test_dataset_X), cv=10)
     88 print("scores = " + str(scores))
     89 
ValueError: Found input variables with inconsistent numbers of samples: [1460, 1459]

2.我的简单基本线性回归方法有什么根本错误吗？

评论编辑：

@CalZ - 第一条评论：

my_test_dataset_X.shape = (1459, 36)
my_train_dataset_X.shape = (1460, 36)

@CalZ - 第二条评论： 一旦我确定我的方法没有根本错误，我会考虑重构代码。

1个回答

如错误消息所述，调用cross_val_score失败，因为形状参数的第一维不同（1460 与 1459）。这与 CSV 文件中的行数一致。但是，潜在的问题是您将测试集和训练集混合在一起。您应该只使用测试集调用它：cross_val_score(lm, my_test_dataset_X, lm.predict(my_test_dataset_X), cv=10). 更新：我最初的建议是不正确的，你不能用你自己的预测来验证！您应该保留标记数据的一个子集，以留待计算交叉验证。
你的不仅仅是线性回归。您的大部分代码负责数据操作（特征选择、数据插补）而不是线性回归。实际上，您正在重用 scikit-learn 的线性回归实现，而不是自己编写代码。如果您想对您的代码片段进行代码审查，也许您应该在http://codereview.stackexchange.com中尝试（我也不知道这是否适合那里，您最好查看他们的帮助中心）。

更新：关于您的代码从数据科学的角度来看是否合理，在我看来（经过快速审查）您正在做合理的事情。有一些可以改进的地方，比如只处理 float64 和 int64（虽然你可以按照这里描述的那样做），只估算 NaN 和 Nones（虽然在某些情况下可能有其他值应该被估算，比如异常值），或者用中位数盲目估算（这是一个安全的决定，但应考虑到每个变量的性质进行评估）。但一般来说似乎还可以。

其它你可能感兴趣的问题

上一篇如何在 Ubuntu 16.04 上访问橙色下一篇兰特指数和调整兰特指数的区别？