我对回归分析很陌生,我正在使用 python 的 statsmodels 来研究经合组织的 GDP/健康/社会服务支出与健康结果 (DALY) 之间的关系。只是为了了解我正在使用的数据,这是一个散点矩阵,对角线是核密度估计:
dalyrate
当我运行on的简单回归时social_exp
,结果显示关于高条件数的警告:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.api as sm
poly_1 = smf.ols(formula='dalyrate ~ 1 + social_exp', data=model_df, missing='drop').fit()
print poly_1.summary()
OLS Regression Results
==============================================================================
Dep. Variable: dalyrate R-squared: 0.253
Model: OLS Adj. R-squared: 0.248
Method: Least Squares F-statistic: 46.85
Date: Fri, 28 Oct 2016 Prob (F-statistic): 2.30e-10
Time: 12:56:43 Log-Likelihood: -1336.8
No. Observations: 140 AIC: 2678.
Df Residuals: 138 BIC: 2683.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.705e+04 635.828 42.541 0.000 2.58e+04 2.83e+04
social_exp -0.9303 0.136 -6.845 0.000 -1.199 -0.662
==============================================================================
Omnibus: 34.504 Durbin-Watson: 0.907
Prob(Omnibus): 0.000 Jarque-Bera (JB): 78.046
Skew: 1.017 Prob(JB): 1.13e-17
Kurtosis: 6.039 Cond. No. 1.03e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.03e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
根据我的阅读,多重共线性不应该是单变量回归的问题,那么可能导致这个问题的原因是什么?看起来gdp_cap
和dalyrate
变量在散点矩阵中都呈正偏态,所以我是否需要对数据进行标准化或标准化以防止出现此警告?如果我不需要,那将是理想的,因为系数上的单位有助于解释结果,但显然我会在必要时这样做。
对于多元回归,我也收到了类似的警告:
poly_2 = smf.ols(formula='dalyrate ~ 1 + social_exp + health_exp + gdp_cap', data=model_df, missing='drop').fit()
print poly_2.summary()
OLS Regression Results
==============================================================================
Dep. Variable: dalyrate R-squared: 0.406
Model: OLS Adj. R-squared: 0.393
Method: Least Squares F-statistic: 30.98
Date: Fri, 28 Oct 2016 Prob (F-statistic): 2.51e-15
Time: 13:04:35 Log-Likelihood: -1320.8
No. Observations: 140 AIC: 2650.
Df Residuals: 136 BIC: 2661.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.86e+04 631.270 45.308 0.000 2.74e+04 2.98e+04
social_exp 0.0379 0.208 0.182 0.855 -0.373 0.449
health_exp -1.1555 0.889 -1.300 0.196 -2.914 0.603
gdp_cap -0.1412 0.053 -2.662 0.009 -0.246 -0.036
==============================================================================
Omnibus: 53.794 Durbin-Watson: 0.663
Prob(Omnibus): 0.000 Jarque-Bera (JB): 157.258
Skew: 1.490 Prob(JB): 7.11e-35
Kurtosis: 7.251 Cond. No. 7.24e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.24e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
任何帮助,将不胜感激!