如何避免逻辑回归中分类变量的共线性?

机器算法验证 回归 物流 多重回归 多重共线性
2022-03-09 14:38:56

我有以下问题:我正在对几个变量执行多元逻辑回归,每个变量都有一个标称尺度。我想在我的回归中避免多重共线性。如果变量是连续的,我可以计算方差膨胀因子 (VIF) 并寻找具有高 VIF 的变量。如果变量通常按比例缩放,我可以计算几对变量的 Spearman 等级相关系数,并将计算值与某个阈值进行比较。但是,如果变量只是名义上的缩放,我该怎么办?一种想法是对独立性进行成对卡方检验,但不同的变量并不都具有相同的共同域。所以这将是另一个问题。有没有可能解决这个问题?

4个回答

我会支持@EdM 的评论(+1)并建议使用正则化回归方法。

我认为弹性网/岭回归方法应该允许您处理共线预测变量。请注意规范化您的特征矩阵X在使用它之前适当地使用它,否则您将冒着不成比例地规范每个功能的风险(是的,我的意思是0/1 columns, you should scale them such that each column has unit variance and mean 0).

Clearly you would have to cross-validate your results to ensure some notion of stability. Let me also note that instability is not a huge problem because it actually suggests that there is not obvious solution/inferential result and simply interpreting the GLM procedure as "ground truth" is incoherent.

The ViF is still a useful measure in your case, but the condition number of your design matrix is a more common approach for categorical data.

The original reference is here:

Belsley, David A.; Kuh, Edwin; Welsch, Roy E. (1980). "The Condition Number". Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley & Sons. pp. 100–104.

And here are more useful links:
https://en.wikipedia.org/wiki/Condition_number

https://epub.ub.uni-muenchen.de/2081/1/report008_statistics.pdf

Another approach would be to perform Multiple Correspondence Analysis (MCA) on your multicollinear independent variables. After that you will end up with orthogonal (perfectly independent) components which you can use as IV in your model. There will be no collinearity present, but it will be hard to intepret effects of your original variables. At the other hand if there is multicollinearity, MCA will unite your correlated IV variables effects into more general effects, which you can find even more interpretable and plausible.

You can check bi-variate correlation by using rank-order or other non-parametric test for categorical variables. It is the same as you check the correlation matrix for a group of continuous variables, just use different test.