我是机器学习的初学者,我想建立一个模型来预测房价。我通过爬取当地住房网站准备了一个数据集,它包含 1000 个样本和只有 4 个特征(纬度、经度、面积和房间数量)。
我在sklearn中尝试了RandomForestRegressor和LinearSVR模型,但是我无法正确训练模型并且MSE超高。
MSE 几乎等于 90,000,000(价格范围的真实值在 5,000,000 - 900,000,000 之间)
这是我的代码:
import numpy as np
from sklearn.svm import LinearSVR
import pandas as pd
import csv
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
df = pd.read_csv('dataset.csv', index_col=False)
X = df.drop('price', axis=1)
X_data = X.values
Y_data = df.price.values
X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size=0.2, random_state=5)
rgr = RandomForestRegressor(n_estimators=100)
svr = LinearSVR()
rgr.fit(X_train, Y_train)
svr.fit(X_train, Y_train)
MSEs = cross_val_score(estimator=rgr,
X=X_train,
y=Y_train,
scoring='mean_squared_error',
cv=5)
MSEsSVR = cross_val_score(estimator=svr,
X=X_train,
y=Y_train,
scoring='mean_squared_error',
cv=5)
MSEs *= -1
RMSEs = np.sqrt(MSEs)
print("Root mean squared error with 95% confidence interval:")
print("{:.3f} (+/- {:.3f})".format(RMSEs.mean(), RMSEs.std()*2))
print("")
我的数据集和特征计数有问题吗?如何使用这种类型的数据集构建预测模型?