In Ridge regression and LASSO, why smaller ββ

机器算法验证 regression lasso ridge-regression regularization

2022-03-27 14:03:11

Can anyone provide an intuitive view on why it is better to have smaller beta?

For LASSO I can understand that, there is a feature selection component here. Less features make the model simpler and therefore less likely to be over-fitting.

However, for ridge, all the features (factors) are kept. Only the values are smaller (in L2 norm sense). How does this make the model simpler?

Can anyone provide an intuitive view on this?

3个回答

TL;DR - Same principle applies to both LASSO and Ridge

Less features make the model simpler and therefore less likely to be over-fitting

This is the same intuition with ridge regression - we prevent the model from over-fitting the data, but instead of targeting small, potentially spurious variables (which get reduced to zero in LASSO), we instead target the biggest coefficients which might be overstating the case for their respective variables.

The L2 penalty generally prevents the model from placing "too much" importance on any one variable, because large coefficients are penalized more than small ones.

This might not seem like it "simplifies" the model, but it does a similar task of preventing the model from over-fitting to the data at hand.

An example to build intuition

Take a concrete example - you might be trying to predict hospital readmissions based on patient characteristics.

In this case, you might have a relatively rare variable (such as an uncommon disease) that happens to be very highly correlated in your training set with readmission. In a dataset of 10,000 patients, you might only see this disease 10 times, with 9 readmissions (an extreme example to be sure)

As a result, the coefficient might be massive relative to the coefficient of other variables. By minimizing both MSE and the L2 penalty, this would be a good candidate for ridge regression to "shrink" towards a smaller value, since it is rare (so doesn't impact MSE as much), and an extreme coefficient value.

There's no guarantee that having smaller weights is actually better. Lasso and ridge regression work by imposing prior knowledge/assumptions/constraints on the solution. This approach will work well if the prior/assumptions/constraints are well suited to the actual distribution that generated the data, and may not work well otherwise. Regarding simplicity/complexity, it's not the individual models that are simpler or more complex. Rather, it's the family of models under consideration.

From a geometric perspective, lasso and ridge regression impose constraints on the weights. For example, the common penalty/Lagrangian form of ridge regression:

min_{β} ‖ y - X β ‖_{2}^{2} + λ ‖ β ‖_{2}^{2}

$\min_\beta \|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2$

can be re-written in the equivalent constraint form:

min_{β} ‖ y - X β ‖_{2}^{2} s.t. ‖ β ‖_{2}^{2} \leq c

$\min_\beta \|y - X\beta\|_2^2 \quad \text{s.t. } \|\beta\|_2^2 \le c$

This makes it clear that ridge regression constrains the weights to lie within a hypersphere whose radius is governed by the regularization parameter. Similarly, lasso constrains the weights to lie within a polytope whose size is governed by the regularization parameter. These constraints mean that most of the original parameter space is off-limits, and we search for the optimal weights within a much smaller subspace. This smaller subspace can be considered less 'complex' than the full space.

From a Bayesian perspective, one can think about the posterior distribution over all possible choices of weights. Both lasso and ridge regression are equivalent to MAP estimation after placing a prior on the weights (lasso uses a Laplacian prior and ridge regression uses a Gaussian prior). A narrower posterior corresponds to greater restriction and less complexity, because high posterior density is given to a smaller set of parameters. For example, multiplying the likelihood function by a narrow Gaussian prior (which corresponds to a large ridge penalty) produces a narrower posterior.

One of the primary reasons to impose constraints/priors is that choosing the optimal model from a more restricted family is less likely to overfit than choosing it from a less restricted family. This is because the less restricted family affords 'more' ways to fit the data, and it's increasingly likely that one of them will be able to fit random fluctuations in the training set. For a more formal treatment, see the bias-variance tradeoff. This doesn't necessarily mean that choosing a model from a more restricted family will work well. Getting good performance requires that the restricted family actually contains good models. This means we have to choose a prior/constraint that's well-matched to the specific problem at hand.

Though the question asked for an intuitive explanation, there is actually a rigorous derivation of the Mean Square Error (MSE) for the ridge regression that shows that there exists values of $\lambda$ achieving a better MSE than the linear regression.

Recall : $MSE(\hat{\beta})=\mathbb{E}[(\hat{\beta}-\beta)(\hat{\beta}-\beta)^T]$ Call $\hat{\beta_\lambda}$ the estimator of $\beta$ for a ridge regression whose shrinkage parameter is $\lambda$ and define : $M(\lambda)=MSE(\hat{\beta_\lambda})$ .

Therefore $M(0)$ is the MSE of a linear regression.

Following these course notes one can show that:

M (0) - M (λ) = λ (X^{T} X + λ I)^{- 1} (2 σ ² I + λ σ ² (X^{T} X)^{- 1} - λ β β^{T}) {(X^{T} X + λ I)^{- 1}}^{T}

$M(0)-M(\lambda)=\lambda(X^TX+\lambda I)^{-1}(2\sigma²I+\lambda\sigma²(X^TX)^{-1}-\lambda\beta\beta^T) \{(X^TX+\lambda I)^{-1}\}^T$

The terms $(X^TX+\lambda I)^{-1}$ are positive definite, but, for $\lambda<2\sigma^2(\beta^T\beta)^{-1}$ , the term in the middle is positive as well. For these values, we have $M(0)>M(\lambda)$ , showing that the ridge regression reduces the Mean Square Error.

其它你可能感兴趣的问题

上一篇二次核是什么样子的？下一篇如何找到闰年额外星期日的概率？