What is the problem with p>np>n

机器算法验证 dimensionality-reduction linear matrix linear-algebra regression-strategies
2022-03-13 15:55:33

I know that this is the solving system of linear equation problem.

But my question is why it is a problem the number of observation is lower than the number of predictors how can that thing happen?

Does not the data collection come from the delicate survey design or experimental design to the extent that they at least think about this thing?

If the data collection want to collect 45 variables to conduct research then why would he collect less than 45 observation? Did I miss something and although the model selection part also eliminated the non-improvement variables on the response and always the collected variable will be eliminated to 45(45p)

So then why would we face the non-unique solution in those case?

2个回答

This is a very good question. When the number of candidate predictors pn

  • If you think about the number of non-redundant linear combination of variables that can be analyzed, this number is min(n,p)min(n,p)
  • With p=ny(x,y)R2=1.0R2
  • If you use any feature selection algorithm such as dreaded stepwise regression models, the list of features "selected" will essentially be a random set of features with no hope of replicating in another sample. This is especially true if there are correlations among the candidate features, e.g., co-linearity.
  • The value of n

In general, a study that intends to analyze 45 variables on 45 subjects is poorly planned, and the only ways to rescue it that I know of are

  • Pre-specify one or two predictors to analyze and ignore the rest
  • Use penalized estimation such as ridge regression to fit all the variables but take the coefficients with a grain of salt (heavy discounting)
  • Use data reduction, e.g., principal components, variable clustering, sparse principal components (my favorite) as discussed in my RMS book and course notes. This involves combining variables that are hard to separate, and not trying to estimate separate effects for them. For n=45 you may only get by with 2 collapsed scores for playing against y. Data reduction (unsupervised learning) is more interpretable than most other methods.

A technical detail: if you use one of the best combination variable selection/penalization methods such as lasso or elastic net you can lower the chance of overfitting but will ultimately be disappointed that the list of selected features is highly unstable and will not replicate in other datasets.

This could occur in many scenarios, few examples are:

  1. Medical data analysis at hospitals. Medical researchers studying a particular cancer primarily can do data collection at their own hospital, and I think it is not a bad thing that they try collect many variables as possible from one particular patient like age, gender, tumour size, MRI, CT volume.
  2. Micro platereader array studies in bioinformatics. It is often the case that you don’t have many species but you want to be able to test for as many effects as possible.
  3. Analysis with images. You have often 16 million pixels while it is very difficult to collect and store that many images.
  4. MRI reconstructions are often similar problems, which need sparse regression techniques, and improving them is really a central question in MRI imaging research.

The solution is really, to look at the regression literature and find what best works for your application.

  1. If you have domain knowledge, incorporate into your prior distribution and take a Bayesian approach with Bayesian Linear Regression.

  2. If you want to find a sparse solution, automatic relevance determination’s empirical Bayes approach could be the way to go.

  3. If you think that with your problem, having a notion of probabilities is inappropriate (like solving a linear systems of equations), it might be worth to look at the Moore-Penrose pseudoinverse.

  4. You can approach it from a feature selection perspective, and reduce the number of p until it is a well-posed problem.