我正在对我的数据集(大约有 15 个输入特征和 1 个目标特征)进行随机森林回归。对于训练集和测试集,我的 R^2 都非常低,<1(如果 <1 不是足够好的 R^2 分数,请告诉我)。
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# load dataset
df = pd.read_csv('Dataset.csv')
# split into input (X) and output (Y) variables
X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)
Y = df.TARGET_COLUMN
# Split the data into 67% for training and 33% for testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)
# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message
print("Predicting Values:")
y_pred = regressor.predict(X_test)
print("Getting Model Performance...")
# Get regression scores
print("R^2 train = ", regressor.score(X_train, Y_train))
print("R^2 test = ", regressor.score(X_test, Y_test))
这将输出以下内容:
Predicting Values:
Getting Model Performance...
R^2 train = 0.9791000275450427
R^2 test = 0.8577464692386905
然后,我检查了测试数据集中的实际目标列值与预测值之间的差异,如下所示:
diff = []
for i in range(len(y_pred)):
if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf
diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error
我发现大多数元素方面的差异在 40-60% 之间,它们的平均值几乎是 50%!
np.mean(diff)
>>> 49.07580695857447
那么,哪一个是正确的?回归分数是否正确并且我的模型对这些数据有好处,或者我计算的元素错误是否正确并且模型对这些数据表现不佳?如果是后者,请告知如何提高预测准确性。
我还检查了 rmse 分数:
import math
rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))
rmse
>>> 3.67328471827293
模型做得很好,这似乎相当高,但如果我错了,请纠正我。
我还检查了不同数量的估计器的 R^2 分数:
import matplotlib.pyplot as plt
model = RandomForestRegressor(n_jobs=-1)
# Try different numbers of n_estimators
estimators = np.arange(10, 200, 10)
scores = []
for n in estimators:
model.set_params(n_estimators=n)
model.fit(X_train, Y_train)
scores.append(model.score(X_test, Y_test))
plt.title("Effect of n_estimators")
plt.xlabel("n_estimator")
plt.ylabel("score")
plt.plot(estimators, scores)
请指教。
我首先尝试使用线性回归,并获得了非常高的 MSE(这就是我尝试随机森林的原因):
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
# The coefficients
print('Coefficients: \n', lr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))
Coefficients:
[ 1.93829229e-01 -4.68738825e-01 2.01635420e-01 6.35902010e-01
6.57354434e-03 5.13180293e-03 2.84015810e-01 -1.31469084e-06
1.95335035e+00]
Mean squared error: 86.92
Variance score: 0.08
