我正在尝试使用 3 种不同的方法构建多元线性回归模型,并且每种方法都得到不同的结果。我认为我必须得到相同的结果,但是Where is this difference come from?
使用 GridSearchCV
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data,
test_size=0.3,random_state =1 )
model = linear_model.LinearRegression()
parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True, False]}
grid = GridSearchCV(model,parameters, cv=None)
grid.fit(X_train, y_train)
print "r2 / variance : ", grid.best_score_
print("Residual sum of squares: %.2f"
% np.mean((grid.predict(X_test) - y_test) ** 2))
输出是:
r2 / variance : 0.823041227357
Residual sum of squares: 0.18
在没有 GridSearchCV 的情况下使用线性回归
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data,
test_size=0.3,random_state =1 )
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print "r2/variance : ", model.score(X_test,y_test)
print("Residual sum of squares: %.2f"
% np.mean((model.predict(X_test) - y_test) ** 2))
输出是:
r2 / variance : 0.883799174674
Residual sum of squares: 0.18
使用 Statsmodel OLS 方法
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data, test_size=0.3,random_state =1 )
x_train = sm.add_constant(X_train)
model = sm.OLS(y_train, x_train)
results = model.fit()
print "r2/variance : ", results.rsquared
输出是:
r2/variance : 0.893686634315
我在三个不同的点上感到困惑。
- 为什么使用 GridSearchCV 不会增加 r_score 以及为什么错误总和相同?
我的猜测是 GridSearchCV 进行了一些交叉验证(可能是 k-fold),所以当我们使用它时 r_square 分数会降低。但我对这个问题不是很清楚。
- Scikit 和 Statsmodel OLS 有什么区别?
> My guess is Statsmodel OLS looks the training error and Scikit looks the test error. So I think that using Scikit OLS is more rational.
- 我们何时以及如何在回归模型上使用 GridSearchCv?
> I have not to much guess.
谢谢你的每一个想法。