我尝试了几件事来找到具有最佳参数的最佳回归模型,但我不能超过 40% 的正确预测。
所以我在一个 excel 文件中有 67741 行。清理后的数据是这样的(只有 4 列,够吗?):
我将尝试解释我的过程。
我去了这个网站https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html。从这张图中,应该适合我的数据的模型是 Lasso 和 ElasticNet,我用这段代码得到了非常糟糕的分数:
classifiers = [
ElasticNetCV(cv=5, random_state=0,max_iter=40000), # i added the max_iter cause i got a warning saying that i should increase it
linear_model.Lasso(alpha=0.1,max_iter=40000)] # i added the max_iter cause i got a warning saying that i should increase it
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
for clf, param in zip(classifiers, param_grid):
name = clf.__class__.__name_
clf.fit(X_train, y_train)
print("=" * len(name))
print("{}".format(name))
print(clf.score(X_test, y_test))
成绩:
============
ElasticNetCV
0.002404878871672067
=====
Lasso
0.0066801704903023396
然后我尝试了其他几个模型,我终于得到了一些东西:
BaggingRegressor(base_estimator=DecisionTreeRegressor(),
max_features=0.5, max_samples=0.5)
分数 :
================
BaggingRegressor
0.3460147571634854
所以我使用了 GridSearch 然后交叉验证分数来获得最好的参数,但是每次我启动它时,我都会得到一个不同的结果:
BaggingRegressor(base_estimator=DecisionTreeRegressor(), bootstrap=True,
bootstrap_features=False,
max_features=0.2, max_samples=0.7, n_estimators=10,
n_jobs=None, oob_score=False, random_state=None, verbose=0,
warm_start=False)
我是这样用的:
param_grid = [{'max_samples': [0.1, 0.2, 0.5, 0.7, 1],
'max_features': [0.1, 0.2, 0.5, 0.7, 1],
'n_estimators': [5,10,15,20,25]}
def grid_search(clf, param_grid):
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
print("=" * len(name))
print(grid_search.best_params_)
print("=" * len(name))
print(grid_search.best_score_)
和这样的交叉验证:scores = cross_val_score(clf, X, y, cv=10)
抱歉,如果它有点长,有 67000 行我不能得到超过 30% 的正确预测?有什么问题 ?
谢谢

