How to include xx and x2x2 into regression, and whether to center them?

机器算法验证 regression multiple-regression polynomial centering
2022-03-27 16:01:05

I want to include the term x and its square x2 (predictor variables) into a regression because I assume that low values of x have a positive effect on the dependent variable and high values have a negative effect. The x2 should capture the effect of the higher values. I therefore expect that the coefficient of x will be positive and the coefficient of x2 will be negative. Besides x, I also include other predictor variables.

I read in some posts here that it is a good idea to center the variables in this case to avoid multicollinearity. When conducting multiple regression, when should you center your predictor variables & when should you standardize them?

  1. Should I center both variables seperately (at the mean) or should I only center x and then take the square or should I only center x2 and include the original x?

  2. Is it a problem if x is a count variable?

In order to avoid x being a count variable, I thought about dividing it by a theoretically defined area, for example 5 square kilometers. This should be a little bit similar to a point density calculation.

However, I am afraid that in this situation my initial assumption about the sign of the coefficients would not hold anymore, as when x=2 and x²=4

x=2/5 km2 = 0.4 km2

but x2 would then be smaller because x2=(2/5)2=0.16.

3个回答

Your question is in fact comprised of several sub-questions, which I will try to address to the best of my understanding.

  • How to distinguish low and high values' dependence on a regression?

Considering x and x2 is a way of doing it, but are you sure your test is conclusive? Will you be able to conclude something useful for all possible outcomes of the regression? I think posing the question clearly beforehand can help, and asking similar and related questions can help as well. For instance, you can consider a threshold of x for which the regression slopes are different. This can be done using moderator variables. If the different slopes (while imposing the same intercept) are compatible then you have no difference, otherwise you provided yourself a clear argument for their difference.

  • When should you center and standartize?

I think this question should not be mixed with the first question and test, and I'm afraid centering around x or x2 beforehand might bias the results. I would advise not to center, at least in a first stage. Remember you will probably not die of multicollinearity, many authors argue it's just equivalent to working with a smaller sample size (here and here).

  • Does transforming the discrete count variable in a (continuous) floating-point variable change the interpretation of the results?

Yes it will, but this will depend heavily on the first 2 points, so I would suggest you to address one thing at a time. I see no reason why the regression would not work without this transformation, so I would advise you to ignore it for now. Note also that by dividing by a common element you are changing the scale at which x2=x, but there are completely different ways of looking at it, like I wrote above, in which this threshold is considered in more explicit way.

In general centering could help to reduce multicollinearity, but "you will probably not die of multicollinearity" (see predrofigueira's answer).

Most important, centering is often needed to make the intercept meaningful. In the simple model yi=α+βxi+ε, the intercept is defined as the expected outcome for x=0. If an x value of zero is not meaningful, neither the itercept is. It is often useful to center the variable x around its mean; in this case, the predictor is of the form (xix¯) and the intercept α is the expected outcome for a subject whose value on xi is equal to the mean x¯.

In such cases, you must center x and then square. You cannot center x and x2 separately, because you are regressing the outcome on a "new" variable, (xix¯), so you must square this new variable. What could centering x2 mean?

You can center a count variable, if its mean is meaningful, but you could just scale it. For example, if x=1,2,3,4,5 and "2" could be a baseline, you can subtract 2: (xi2)=1,0,1,2,3. The intercept becomes the expected outcome for a subject whose value on xi is equal to "2", a reference value.

As to dividing, no trouble: your estimated coefficients would be larger! Gelman and Hill, §4.1, give an example:

earnings=61000+1300height (in inches)+errorearnings=61000+51height (in millimeters)+errorearnings=61000+81000000height (in miles)+error

One inch is 25.4 millimeters, so 51 is 1300/25.4. One inch is 1.6e5 emiles, so 81000000 is 1300/1.6e5. But these three equations are entirely equivalent.

I assume that low values of x have a positive effect on the dependent variable and high values have a negative effect.

While I appreciate others' treatment of centering and interpretation of coefficients, what you've described here is simply a linear effect. In other words, what you've described doesn't indicate any need to test the square of x.