在回归任务中,我得到了低 R^2 值,但测试集和预测值之间的元素差异很大

数据挖掘 机器学习 Python 预测建模 回归 随机森林
2022-02-16 02:29:30

我正在对我的数据集(大约有 15 个输入特征和 1 个目标特征)进行随机森林回归。对于训练集和测试集,我的 R^2 都非常低,<1(如果 <1 不是足够好的 R^2 分数,请告诉我)。

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# load dataset
df = pd.read_csv('Dataset.csv')

# split into input (X) and output (Y) variables
X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)
Y = df.TARGET_COLUMN

# Split the data into 67% for training and 33% for testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)

# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message


print("Predicting Values:")
y_pred = regressor.predict(X_test)

print("Getting Model Performance...")

# Get regression scores
print("R^2 train = ", regressor.score(X_train, Y_train))
print("R^2 test = ", regressor.score(X_test, Y_test))

这将输出以下内容:

Predicting Values:
Getting Model Performance...
R^2 train =  0.9791000275450427
R^2 test = 0.8577464692386905

然后,我检查了测试数据集中的实际目标列值与预测值之间的差异,如下所示:

diff = []
for i in range(len(y_pred)):
    if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf
        diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error

我发现大多数元素方面的差异在 40-60% 之间,它们的平均值几乎是 50%!

np.mean(diff)
>>> 49.07580695857447

那么,哪一个是正确的?回归分数是否正确并且我的模型对这些数据有好处,或者我计算的元素错误是否正确并且模型对这些数据表现不佳?如果是后者,请告知如何提高预测准确性。


我还检查了 rmse 分数:

import math
rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))
rmse
>>> 3.67328471827293

模型做得很好,这似乎相当高,但如果我错了,请纠正我。

我还检查了不同数量的估计器的 R^2 分数:

import matplotlib.pyplot as plt
model = RandomForestRegressor(n_jobs=-1)
# Try different numbers of n_estimators
estimators = np.arange(10, 200, 10)
scores = []
for n in estimators:
    model.set_params(n_estimators=n)
    model.fit(X_train, Y_train)
    scores.append(model.score(X_test, Y_test))
plt.title("Effect of n_estimators")
plt.xlabel("n_estimator")
plt.ylabel("score")
plt.plot(estimators, scores)

在此处输入图像描述

请指教。


我首先尝试使用线性回归,并获得了非常高的 MSE(这就是我尝试随机森林的原因):

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# The coefficients
print('Coefficients: \n', lr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))


Coefficients: 
 [ 1.93829229e-01 -4.68738825e-01  2.01635420e-01  6.35902010e-01
  6.57354434e-03  5.13180293e-03  2.84015810e-01 -1.31469084e-06
  1.95335035e+00]
Mean squared error: 86.92
Variance score: 0.08
1个回答

这条线在我看来是错误的:

diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])

腹肌不应该围绕整个计算吗?

diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))

除此之外,RMSE 计算看起来很准确,并且在误差范围内,并且R2很棒,所以在所有条件相同的情况下,我会倾向于寻找您在评估错误时做错的事情。这就是为什么我专注于你的计算。

另一种想法是,您检查过异常值吗?这可能会严重影响某些措施,而不会影响其他措施。