我想执行交叉验证以找到 Lasso 的正则化参数。我在 python 中使用 scikit-learn 库。我首先生成数据集,然后执行 k 折交叉验证。这是我的代码(大部分来自 scikit-learn 网站上的示例):
# generate some sparse data to play with
import numpy as np
n_samples, n_features = 5000, 200
X = np.random.randn(n_samples, n_features)
coef = 3 * np.random.randn(n_features)
coef[10:] = 0 # sparsify coef
y = np.dot(X, coef)
# add noise
y += 0.01 * np.random.normal((n_samples,))
# Split data in train set and test set
n_samples = X.shape[0]
X_train, y_train = X[:n_samples / 2], y[:n_samples / 2]
X_test, y_test = X[n_samples / 2:], y[n_samples / 2:]
###############################################################################
# Lasso
from sklearn.linear_model import Lasso
from sklearn.cross_validation import KFold
from matplotlib import pyplot as plt
kf = KFold(X_train.shape[0], n_folds = 10,)
alphas = np.logspace(-16, 3, num = 50, base = 2)
e_alphas = list()
e_alphas_r = list() #holds average r2 error
for alpha in alphas:
lasso = Lasso(alpha=alpha)
err = list()
err_2 = list()
for tr_idx, tt_idx in kf:
X_tr , X_tt = X_train[tr_idx], X_test[tt_idx]
y_tr, y_tt = y_train[tr_idx], y_test[tt_idx]
lasso.fit(X_tr, y_tr)
y_hat = lasso.predict(X_tt)
err_2.append(lasso.score(X_tt,y_tt))
err.append(np.average((y_hat - y_tt)**2))
e_alphas.append(np.average(err))
e_alphas_r.append(np.average(err_2))
plt.figsize = (15,10)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(alphas, e_alphas, 'b-')
ax.plot(alphas, e_alphas_r, 'g--')
ax.set_xlabel("alpha")
plt.show()
误差曲线如下图所示:
我知道在 scikit-learn 中还有其他方法可以进行 lassoCV,但我只想知道在给定我得到的那种图表的情况下如何选择参数。感谢您的回复。

