James-Stein estimator: How did Efron and Morris calculate σ2σ2 in shrinkage factor for their baseball example?

机器算法验证 estimation regularization steins-phenomenon
2022-01-18 19:18:46

I have a question on calculating James-Stein Shrinkage factor in the 1977 Scientific American paper by Bradley Efron and Carl Morris, "Stein's Paradox in Statistics".

I gathered the data for the baseball players and it is given below:

Name, avg45, avgSeason    
Clemente, 0.400, 0.346    
Robinson, 0.378, 0.298    
Howard, 0.356, 0.276    
Johnstone, 0.333, 0.222    
Berry, 0.311, 0.273    
Spencer, 0.311, 0.270    
Kessinger, 0.289, 0.263    
Alvarado, 0.267, 0.210    
Santo, 0.244, 0.269    
Swoboda, 0.244, 0.230    
Unser, 0.222, 0.264    
Williams, 0.222, 0.256    
Scott, 0.222, 0.303    
Petrocelli, 0.222, 0.264    
Rodriguez, 0.222, 0.226    
Campaneris, 0.200, 0.285    
Munson, 0.178, 0.316    
Alvis, 0.156, 0.200

avg45 is the average after 45 at bats and is denoted as y in the article. avgSeason is the end of the season average.

The James-Stein estimator for the average (z) is given by

z=y¯+c(yy¯)
and the the shrinkage factor c is given by (page 5 of the Scientific American 1977 article)
c=1(k3)σ2(yy¯)2,

where k is the number of unknown means. Here there are 18 players so k=18. I can calculate (yy¯)2 using avg45 values. But I don't know how to calculate σ2. The authors say c=0.212 for the given data set.

I tried using both σx2 and σy2 for σ2 but they don't give the correct answer of c=0.212

Can anybody be kind enough to let me know how to calculate σ2 for this data set?

2个回答

The parameter σ2 is the (unknown) common variance of the vector components, each of which we assume are normally distributed. For the baseball data we have 45Yibinom(45,pi), so the normal approximation to the binomial distribution gives (taking pi^=Yi)

p^inorm(mean=pi,var=pi(1pi)/45).

Obviously in this case the variances are not equal, yet if they had been equal to a common value then we could estimate it with the pooled estimator

σ^2=p^(1p^)45,
where p^ is the grand mean
p^=11845i=11845Yi=Y¯.
It looks as though this is what Efron and Morris have done (in the 1977 paper).

You can check this with the following R code. Here are the data:

y <- c(0.4, 0.378, 0.356, 0.333, 0.311, 0.311, 0.289, 0.267, 0.244, 0.244, 0.222, 0.222, 0.222, 0.222, 0.222, 0.2, 0.178, 0.156)

and here is the estimate for σ2:

s2 <- mean(y)*(1 - mean(y))/45

which is σ^20.004332392. The shrinkage factor in the paper is then

1 - 15*s2/(17*var(y))

which gives c0.2123905. Note that in the second paper they made a transformation to sidestep the variance problem (as @Wolfgang said). Also note in the 1975 paper they used k2 while in the 1977 paper they used k3.

I am not quite sure about the c=0.212, but the following article provides a much more detailed description of these data:

Efron, B., & Morris, C. (1975). Data analysis using Stein's estimator and its generalizations. Journal of the American Statistical Association, 70(350), 311-319 (link to pdf)

or more detailed

Efron, B., & Morris, C. (1974). Data analysis using Stein's estimator and its generalizations. R-1394-OEO, The RAND Corporation, March 1974 (link to pdf).

On page 312, you will see that Efron & Morris use an arc-sin transformation of these data, so that the variance of the batting averages is approximately unity:

> dat <- read.table("data.txt", header=T, sep=",")
> yi  <- dat$avg45
> k   <- length(yi)
> yi  <- sqrt(45) * asin(2*yi-1)
> c   <- 1 - (k-3)*1 / sum((yi - mean(yi))^2)
> c
[1] 0.2091971

Then they use c=.209 for the computation of the z values, which we can easily back-transform:

> zi  <- mean(yi) + c * (yi - mean(yi))
> round((sin(zi/sqrt(45)) + 1)/2,3) ### back-transformation
[1] 0.290 0.286 0.282 0.277 0.273 0.273 0.268 0.264 0.259
[10] 0.259 0.254 0.254 0.254 0.254 0.254 0.249 0.244 0.239

So these are the values of the Stein estimator. For Clemente, we get .290, which is quite close to the .294 from the 1977 article.