What is the relationship between the function E(Y∣X=x)E(Y∣X=x)

机器算法验证 regression linear-model
2022-03-12 15:21:39

Consider the function

r(x)=E(YX=x)

This has been called the regression function in a textbook I'm using. I'm trying to figure out the relationship between this function and the classical linear regression model.

So, I know that it is a theorem* that we may write

Y=r(X)+ϵ

for some random variable ϵE(ϵ)=0

Now suppose that we have

Y=β0+β1X+ϵ

This is the classical 1-dimensional regression function (assuming the β0β1

Question: Is it then a mathematical theorem that if Y

r(X)=E(YX)=(β0+β1X)?

And is this why the function E(YX)

EDIT: The theorem that I am making use of is as follows (from All of Statistics pg. 89):

Regression models are sometimes written as

Y=r(X)+ϵ

where E(ϵ)=0ϵ=Yr(X)Y=Y+r(X)r(X)=r(X)+ϵE(ϵ)=EE(ϵX)=E(E(Yr(X))X)=E(E(YX)r(X))=E(r(X)r(X))=0

1个回答

Summarizing the question:

Given Y=β0+β1X+ε, is it then a mathematical theorem that r(X)=E(YX)=(β0+β1X)?

Yes, by basic properties of expectation:

E(YX)=E(β0+β1X+ε)=E(β0)+E(β1X)+E(ε)(linearity of expectation)=β0+β1X+0(Noting that X is constant herebecause we conditioned on it.)=β0+β1X

The historical reasons for regression being called regression relate to Galton noticing the "regression to the mean" effect -- initially in an experiment in plants involving seed-size of offspring compared to the seed size of parents. A relationship through the mean seed size on both variables will have slope less than 1 (which slope can be estimated by what we call linear regression). The smaller the slope the stronger the "regression" effect. The issue is illustrated by Galton in the linked pdf by heights of children (as adults) compared to average heights of parents (females being scaled up by a constant factor of 8% to make them comparable to males). The diagrams on the third to fifth pages indicate something of what was observed.

So an attempt to estimate the size of this "regression to the mean" is obtained by what came to be called linear regression. Of course there's nothing special going on - the regression to the mean isn't some special biological "drive to mediocrity" as might originally have been guessed, but a fairly simple consequence of the mathematics of the situation in essentially the same sense that correlations are always between 1 and 1.