How defensible is it to choose λλ in a LASSO model so that it yields the number of nonzero predictors one desires?

机器算法验证 lasso
2022-03-26 23:51:09

When I determine my lambda through cross-validation, all coefficients become zero. But I have some hints from the literature that some of the predictors should definitely affect the outcome. Is it rubbish to arbitrarily choose lambda so that there is just as much sparsity as one desires?

I want to select the top 10 or so predictors out of 135 for a cox model and effect sizes unfortunately are small.

3个回答

If you want to have at least a definite number of predictors with some range of values defined by the literature, why choose the pure-LASSO approach to begin with? As @probabilityislogic suggested, you should be using some informative priors on those variables where you have some knowledge about. If you want to retain some of the LASSO properties for the rest of the predictors, maybe you could use a prior with a double exponential distribution for each other input, i.e., use a density of the form

p(βi)=λ2exp(λ|βi|),
λ is the lagrange multiplier corresponding to the pure-LASSO solution. This last statement comes from the fact that, in the absense of the variables with the informative priors, this is another way of deriving the LASSO (by maximizing the posterior mode given normality assumptions for the residuals).

There exists a nice way to perform LASSO but use a fixed number of predictors. It is Least angle regression (LAR or LARS) described in Efron's paper. During iterative procedure it creates a number of linear models, each new one has one more predictor, so you can select one with desired number of predictors.

Another way is l1 or l2 regularization. As mentioned by Nestor using appropriate priors you can incorporate prior knowledge into the model. So called relevance vector machine by Tipping can be useful.

No, that is not defensible. The great hurdle that model selection procedures are designed to overcome is that the the cardinality of the true support |S|=|{j:βj0}| is unknown. (Here we have that β is the "true" coefficient.) Because |S| is unknown, a model selection procedure has to exhaustively search over all 2p possible models; however, if we did know |S|, we could just check the (p|S|) models, which is far fewer.

The theory of the lasso relies on the regularization parameter λ being sufficiently large so as to make the selected model sufficiently sparse. It could be that your 10 features are too many or too few, since it isn't trivial to turn a lower bound on λ into an upper bound on |S|.

Let β^ be our data-driven estimate for β, and put S^={j:β^j0}. Then, perhaps you're trying to ensure that SS^ so that you've recovered at least the relevant features? Or maybe you're trying to establish that S^S so that you know that features you've found are all worthwhile? In these cases, your procedure would be more justified if you had prior information on the relative sizes of S.

Also, note, you can leave some coefficients unpenalized when performing lasso in, for instance, glmnet.