机器算法验证 - James-Stein estimator: How did Efron and Morris calculate σ2σ2 in shrinkage factor for their baseball example? - 吾爱随笔录

James-Stein estimator: How did Efron and Morris calculate σ2σ2 in shrinkage factor for their baseball example?

机器算法验证 estimation regularization steins-phenomenon

2022-01-18 19:18:46

I have a question on calculating James-Stein Shrinkage factor in the 1977 Scientific American paper by Bradley Efron and Carl Morris, "Stein's Paradox in Statistics".

I gathered the data for the baseball players and it is given below:

Name, avg45, avgSeason    
Clemente, 0.400, 0.346    
Robinson, 0.378, 0.298    
Howard, 0.356, 0.276    
Johnstone, 0.333, 0.222    
Berry, 0.311, 0.273    
Spencer, 0.311, 0.270    
Kessinger, 0.289, 0.263    
Alvarado, 0.267, 0.210    
Santo, 0.244, 0.269    
Swoboda, 0.244, 0.230    
Unser, 0.222, 0.264    
Williams, 0.222, 0.256    
Scott, 0.222, 0.303    
Petrocelli, 0.222, 0.264    
Rodriguez, 0.222, 0.226    
Campaneris, 0.200, 0.285    
Munson, 0.178, 0.316    
Alvis, 0.156, 0.200

avg45 is the average after $45$ at bats and is denoted as $y$ in the article. avgSeason is the end of the season average.

The James-Stein estimator for the average ( $z$ ) is given by

z = \bar{y} + c (y - \bar{y})

$z = \bar{y} + c (y-\bar{y})$ and the the shrinkage factor

c

$c$ is given by (page 5 of the Scientific American 1977 article)

c = 1 - \frac{(k - 3) σ^{2}}{\sum (y - \bar{y})^{2}},

$c = 1 - \frac{(k-3) \sigma^2} {\sum (y - \bar{y})^2},$

where $k$ is the number of unknown means. Here there are 18 players so $k = 18$ . I can calculate $\sum (y - \bar{y})^2$ using avg45 values. But I don't know how to calculate $\sigma^2$ . The authors say $c = 0.212$ for the given data set.

I tried using both $\sigma_{x}^2$ and $\sigma_{y}^2$ for $\sigma^2$ but they don't give the correct answer of $c = 0.212$

Can anybody be kind enough to let me know how to calculate $\sigma^2$ for this data set?

2个回答

The parameter $\sigma^2$ is the (unknown) common variance of the vector components, each of which we assume are normally distributed. For the baseball data we have $45 \cdot Y_i \sim \mathsf{binom}(45,p_i)$ , so the normal approximation to the binomial distribution gives (taking $\hat{p_{i}} = Y_{i}$ )

{\hat{p}}_{i} \approx n o r m (m e a n = p_{i}, v a r = p_{i} (1 - p_{i}) / 45) .

$\hat{p}_{i}\approx \mathsf{norm}(\mathtt{mean}=p_{i},\mathtt{var} = p_{i}(1-p_{i})/45).$

Obviously in this case the variances are not equal, yet if they had been equal to a common value then we could estimate it with the pooled estimator

{\hat{σ}}^{2} = \frac{\hat{p} (1 - \hat{p})}{45},

$\hat{\sigma}^2 = \frac{\hat{p}(1 - \hat{p})}{45},$ where

\hat{p}

$\hat{p}$ is the grand mean

\hat{p} = \frac{1}{18 \cdot 45} \sum_{i = 1}^{18} 45 \cdot Y_{i} = \bar{Y} .

$\hat{p} = \frac{1}{18\cdot 45}\sum_{i = 1}^{18}45\cdot{Y_{i}}=\overline{Y}.$ It looks as though this is what Efron and Morris have done (in the 1977 paper).

You can check this with the following R code. Here are the data:

y <- c(0.4, 0.378, 0.356, 0.333, 0.311, 0.311, 0.289, 0.267, 0.244, 0.244, 0.222, 0.222, 0.222, 0.222, 0.222, 0.2, 0.178, 0.156)

and here is the estimate for $\sigma^2$ :

s2 <- mean(y)*(1 - mean(y))/45

which is $\hat{\sigma}^2 \approx 0.004332392$ . The shrinkage factor in the paper is then

1 - 15*s2/(17*var(y))

which gives $c \approx 0.2123905$ . Note that in the second paper they made a transformation to sidestep the variance problem (as @Wolfgang said). Also note in the 1975 paper they used $k - 2$ while in the 1977 paper they used $k - 3$ .

I am not quite sure about the $c = 0.212$ , but the following article provides a much more detailed description of these data:

Efron, B., & Morris, C. (1975). Data analysis using Stein's estimator and its generalizations. Journal of the American Statistical Association, 70(350), 311-319 (link to pdf)

or more detailed

Efron, B., & Morris, C. (1974). Data analysis using Stein's estimator and its generalizations. R-1394-OEO, The RAND Corporation, March 1974 (link to pdf).

On page 312, you will see that Efron & Morris use an arc-sin transformation of these data, so that the variance of the batting averages is approximately unity:

> dat <- read.table("data.txt", header=T, sep=",")
> yi  <- dat$avg45
> k   <- length(yi)
> yi  <- sqrt(45) * asin(2*yi-1)
> c   <- 1 - (k-3)*1 / sum((yi - mean(yi))^2)
> c
[1] 0.2091971

Then they use c=.209 for the computation of the $z$ values, which we can easily back-transform:

> zi  <- mean(yi) + c * (yi - mean(yi))
> round((sin(zi/sqrt(45)) + 1)/2,3) ### back-transformation
[1] 0.290 0.286 0.282 0.277 0.273 0.273 0.268 0.264 0.259
[10] 0.259 0.254 0.254 0.254 0.254 0.254 0.249 0.244 0.239

So these are the values of the Stein estimator. For Clemente, we get .290, which is quite close to the .294 from the 1977 article.

其它你可能感兴趣的问题

上一篇如何在 R 的 auto.arima() 中设置 xreg 参数？下一篇交叉验证之外的超参数调整有多糟糕？