机器算法验证 - How to include xx and x2x2 into regression, and whether to center them? - 吾爱随笔录

How to include xx and x2x2 into regression, and whether to center them?

机器算法验证 regression multiple-regression polynomial centering

2022-03-27 16:01:05

I want to include the term $x$ and its square $x^2$ (predictor variables) into a regression because I assume that low values of $x$ have a positive effect on the dependent variable and high values have a negative effect. The $x^2$ should capture the effect of the higher values. I therefore expect that the coefficient of $x$ will be positive and the coefficient of $x^2$ will be negative. Besides $x$ , I also include other predictor variables.

I read in some posts here that it is a good idea to center the variables in this case to avoid multicollinearity. When conducting multiple regression, when should you center your predictor variables & when should you standardize them?

Should I center both variables seperately (at the mean) or should I only center $x$ and then take the square or should I only center $x^2$ and include the original $x$ ?
Is it a problem if $x$ is a count variable?

In order to avoid $x$ being a count variable, I thought about dividing it by a theoretically defined area, for example 5 square kilometers. This should be a little bit similar to a point density calculation.

However, I am afraid that in this situation my initial assumption about the sign of the coefficients would not hold anymore, as when $x=2$ and $x²=4$

$x= 2 / 5 \text{ km}^2$ = $0.4 \text{ km}^2$

but $x^2$ would then be smaller because $x^2= (2/5)^2= 0.16$ .

3个回答

Your question is in fact comprised of several sub-questions, which I will try to address to the best of my understanding.

How to distinguish low and high values' dependence on a regression?

Considering $x$ and $x^2$ is a way of doing it, but are you sure your test is conclusive? Will you be able to conclude something useful for all possible outcomes of the regression? I think posing the question clearly beforehand can help, and asking similar and related questions can help as well. For instance, you can consider a threshold of $x$ for which the regression slopes are different. This can be done using moderator variables. If the different slopes (while imposing the same intercept) are compatible then you have no difference, otherwise you provided yourself a clear argument for their difference.

When should you center and standartize?

I think this question should not be mixed with the first question and test, and I'm afraid centering around $x$ or $x^2$ beforehand might bias the results. I would advise not to center, at least in a first stage. Remember you will probably not die of multicollinearity, many authors argue it's just equivalent to working with a smaller sample size (here and here).

Does transforming the discrete count variable in a (continuous) floating-point variable change the interpretation of the results?

Yes it will, but this will depend heavily on the first 2 points, so I would suggest you to address one thing at a time. I see no reason why the regression would not work without this transformation, so I would advise you to ignore it for now. Note also that by dividing by a common element you are changing the scale at which $x^2 = x$ , but there are completely different ways of looking at it, like I wrote above, in which this threshold is considered in more explicit way.

In general centering could help to reduce multicollinearity, but "you will probably not die of multicollinearity" (see predrofigueira's answer).

Most important, centering is often needed to make the intercept meaningful. In the simple model $y_i=\alpha+\beta x_i+\varepsilon$ , the intercept is defined as the expected outcome for $x=0$ . If an $x$ value of zero is not meaningful, neither the itercept is. It is often useful to center the variable $x$ around its mean; in this case, the predictor is of the form $(x_i-\bar{x})$ and the intercept $\alpha$ is the expected outcome for a subject whose value on $x_i$ is equal to the mean $\bar{x}$ .

In such cases, you must center $x$ and then square. You cannot center $x$ and $x^2$ separately, because you are regressing the outcome on a "new" variable, $(x_i-\bar{x})$ , so you must square this new variable. What could centering $x^2$ mean?

You can center a count variable, if its mean is meaningful, but you could just scale it. For example, if $x=1,2,3,4,5$ and "2" could be a baseline, you can subtract 2: $(x_i-2)=-1,0,1,2,3$ . The intercept becomes the expected outcome for a subject whose value on $x_i$ is equal to "2", a reference value.

As to dividing, no trouble: your estimated coefficients would be larger! Gelman and Hill, §4.1, give an example:

\begin{aligned} earnings & = - 61000 + 1300 \cdot height (in inches) + error \\ earnings & = - 61000 + 51 \cdot height (in millimeters) + error \\ earnings & = - 61000 + 81000000 \cdot height (in miles) + error \end{aligned}

$\begin{align} \text{earnings}&=-61000+1300\cdot\text{height (in inches)}+\text{error} \\ \text{earnings}&=-61000+51\cdot\text{height (in millimeters)}+\text{error}\\ \text{earnings}&=-61000+81000000\cdot\text{height (in miles)}+\text{error} \end{align}$

One inch is $25.4$ millimeters, so $51$ is $1300/25.4$ . One inch is $1.6e-5$ emiles, so $81000000$ is $1300/1.6e-5$ . But these three equations are entirely equivalent.

I assume that low values of x have a positive effect on the dependent variable and high values have a negative effect.

While I appreciate others' treatment of centering and interpretation of coefficients, what you've described here is simply a linear effect. In other words, what you've described doesn't indicate any need to test the square of x.

其它你可能感兴趣的问题

上一篇GLMM 的 anova III 型测试下一篇围绕非显着效应的狭窄置信区间能否为无效提供证据？