python statsmodels回归中条件数高的原因?

机器算法验证 回归 Python 统计模型 条件数
2022-03-22 11:02:01

我对回归分析很陌生,我正在使用 python 的 statsmodels 来研究经合组织的 GDP/健康/社会服务支出与健康结果 (DALY) 之间的关系。只是为了了解我正在使用的数据,这是一个散点矩阵,对角线是核密度估计:

散点矩阵结果

dalyrate当我运行on的简单回归时social_exp,结果显示关于高条件数的警告:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.api as sm

poly_1 = smf.ols(formula='dalyrate ~ 1 + social_exp', data=model_df, missing='drop').fit()
print poly_1.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               dalyrate   R-squared:                       0.253
Model:                            OLS   Adj. R-squared:                  0.248
Method:                 Least Squares   F-statistic:                     46.85
Date:                Fri, 28 Oct 2016   Prob (F-statistic):           2.30e-10
Time:                        12:56:43   Log-Likelihood:                -1336.8
No. Observations:                 140   AIC:                             2678.
Df Residuals:                     138   BIC:                             2683.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept   2.705e+04    635.828     42.541      0.000      2.58e+04  2.83e+04
social_exp    -0.9303      0.136     -6.845      0.000        -1.199    -0.662
==============================================================================
Omnibus:                       34.504   Durbin-Watson:                   0.907
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               78.046
Skew:                           1.017   Prob(JB):                     1.13e-17
Kurtosis:                       6.039   Cond. No.                     1.03e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.03e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

根据我的阅读,多重共线性不应该是单变量回归的问题,那么可能导致这个问题的原因是什么?看起来gdp_capdalyrate变量在散点矩阵中都呈正偏态,所以我是否需要对数据进行标准化或标准化以防止出现此警告?如果我不需要,那将是理想的,因为系数上的单位有助于解释结果,但显然我会在必要时这样做。

对于多元回归,我也收到了类似的警告:

poly_2 = smf.ols(formula='dalyrate ~ 1 + social_exp + health_exp + gdp_cap', data=model_df, missing='drop').fit()
print poly_2.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               dalyrate   R-squared:                       0.406
Model:                            OLS   Adj. R-squared:                  0.393
Method:                 Least Squares   F-statistic:                     30.98
Date:                Fri, 28 Oct 2016   Prob (F-statistic):           2.51e-15
Time:                        13:04:35   Log-Likelihood:                -1320.8
No. Observations:                 140   AIC:                             2650.
Df Residuals:                     136   BIC:                             2661.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    2.86e+04    631.270     45.308      0.000      2.74e+04  2.98e+04
social_exp     0.0379      0.208      0.182      0.855        -0.373     0.449
health_exp    -1.1555      0.889     -1.300      0.196        -2.914     0.603
gdp_cap       -0.1412      0.053     -2.662      0.009        -0.246    -0.036
==============================================================================
Omnibus:                       53.794   Durbin-Watson:                   0.663
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              157.258
Skew:                           1.490   Prob(JB):                     7.11e-35
Kurtosis:                       7.251   Cond. No.                     7.24e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.24e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

任何帮助,将不胜感激!

1个回答

我在搜索中找到了这个页面,因为我有同样的问题,但我想我已经弄清楚发生了什么。

首先,问题的演示:

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
x = np.arange(1000.,1030.,1.)   
y = 0.5*x
X = sm.add_constant(x)
plt.plot(x, y,'x')
plt.show()

在此处输入图像描述

mod_ols = sm.OLS(y, X)
res_ols = mod_ols.fit()
print(res_ols.summary())

在此处输入图像描述

注意非常高的条件数 1.19e+05。这是因为我们将一条线拟合到这些点,然后将这条线一直投影回原点 (x=0) 以找到 y 截距。该 y 截距对数据点中的小幅移动非常敏感。条件数考虑了任一拟合参数对输入数据的高灵敏度,因此当所有数据都远离 x=0 的一侧时,条件数高。

为了解决这个问题,我们只需将 x 值居中:

x -= np.average(x)
X = sm.add_constant(x)
plt.plot(x, y,'x')
plt.show()

在此处输入图像描述

在此处输入图像描述

条件数现在大大减少到只有 8.66。请注意,拟合斜率和计算出的 R**2 等没有变化。

我的结论:在针对单个变量进行回归的情况下,除非您关心 y 截距对输入数据的敏感性,否则不要担心条件数。如果这样做,则首先将 x 值居中。