机器算法验证 - 使用模拟进行重要性采样的覆盖率低于预期 - 吾爱随笔录

使用模拟进行重要性采样的覆盖率低于预期

机器算法验证 r 模拟指数分布重要性抽样

2022-03-02 04:51:16

我试图回答Evaluate integral with Importance sampling method in R的问题。基本上，用户需要计算

\int_{0}^{π} f (x) d x = \int_{0}^{π} \frac{1}{\cos (x)^{2} + x^{2}} d x

$\int_{0}^{\pi}f(x)dx=\int_{0}^{\pi}\frac{1}{\cos(x)^2+x^2}dx$

使用指数分布作为重要性分布

q (x) = λ \exp^{- λ x}

$q(x)=\lambda\ \exp^{-\lambda x}$

并找到 $\lambda$ 这可以更好地近似积分（它是self-study）。我将问题改写为对平均值的评估 $\mu$ 的 $f(x)$ 超过 $[0,\pi]$ : 积分就是 $\pi\mu$ .

因此，让 $p(x)$ 成为的pdf $X\sim\mathcal{U}(0,\pi)$ ，然后让 $Y\sim f(X)$ : 现在的目标是估计

μ = E [Y] = E [f (X)] = \int_{R} f (x) p (x) d x = \int_{0}^{π} \frac{1}{\cos (x)^{2} + x^{2}} \frac{1}{π} d x

$\mu=\mathbb{E}[Y]=\mathbb{E}[f(X)]=\int_{\mathbb{R}}f(x)p(x)dx=\int_{0}^{\pi}\frac{1}{\cos(x)^2+x^2}\frac{1}{\pi}dx$

使用重要性抽样。我在 R 中进行了模拟：

# clear the environment and set the seed for reproducibility
rm(list=ls())
gc()
graphics.off()
set.seed(1)

# function to be integrated
f <- function(x){
    1 / (cos(x)^2+x^2)
}

# importance sampling
importance.sampling <- function(lambda, f, B){
    x <- rexp(B, lambda) 
    f(x) / dexp(x, lambda)*dunif(x, 0, pi)
}

# mean value of f
mu.num <- integrate(f,0,pi)$value/pi

# initialize code
means  <- 0
sigmas <- 0
error  <- 0
CI.min <- 0
CI.max <- 0
CI.covers.parameter <- FALSE

# set a value for lambda: we will repeat importance sampling N times to verify
# coverage
N <- 100
lambda <- rep(20,N)

# set the sample size for importance sampling
B <- 10^4

# - estimate the mean value of f using importance sampling, N times
# - compute a confidence interval for the mean each time
# - CI.covers.parameter is set to TRUE if the estimated confidence 
#   interval contains the mean value computed by integrate, otherwise
# is set to FALSE
j <- 0
for(i in lambda){
    I <- importance.sampling(i, f, B)
    j <- j + 1
    mu <- mean(I)
    std <- sd(I)
    lower.CB <- mu - 1.96*std/sqrt(B)  
    upper.CB <- mu + 1.96*std/sqrt(B)  
    means[j] <- mu
    sigmas[j] <- std
    error[j] <- abs(mu-mu.num)
    CI.min[j] <- lower.CB
    CI.max[j] <- upper.CB
    CI.covers.parameter[j] <- lower.CB < mu.num & mu.num < upper.CB
}

# build a dataframe in case you want to have a look at the results for each run
df <- data.frame(lambda, means, sigmas, error, CI.min, CI.max, CI.covers.parameter)

# so, what's the coverage?
mean(CI.covers.parameter)
# [1] 0.19

该代码基本上是重要性采样的简单实现，遵循此处使用的符号。然后重复重要性采样 $N$ 多次估计 $\mu$ ，并且每次检查 95% 区间是否覆盖实际平均值。

如您所见，对于 $\lambda=20$ 实际覆盖率仅为 0.19。并且越来越 $B$ 到值，例如 $10^6$ 没有帮助（覆盖范围更小，0.15）。为什么会这样？

1个回答

重要性抽样对重要性分布的选择非常敏感。既然你选择了 $\lambda = 20$ ，您绘制的样本rexp的平均值为 $1/20$ 有方差 $1/400$ . 这是你得到的分布

但是，您要评估的积分从 0 到 $\pi =3.14$ . 所以你想使用一个 $\lambda$ 这给了你这样的范围。我用 $\lambda = 1$ .

使用 $\lambda = 1$ 我将能够探索0到的完整积分空间 $\pi$ , 并且似乎只有几次平局 $\pi$ 将被浪费。现在我重新运行你的代码，只改变 $\lambda = 1$ .

# clear the environment and set the seed for reproducibility
rm(list=ls())
gc()
graphics.off()
set.seed(1)

# function to be integrated
f <- function(x){
  1 / (cos(x)^2+x^2)
}

# importance sampling
importance.sampling <- function(lambda, f, B){
  x <- rexp(B, lambda) 
  f(x) / dexp(x, lambda)*dunif(x, 0, pi)
}

# mean value of f
mu.num <- integrate(f,0,pi)$value/pi

# initialize code
means  <- 0
sigmas <- 0
error  <- 0
CI.min <- 0
CI.max <- 0
CI.covers.parameter <- FALSE

# set a value for lambda: we will repeat importance sampling N times to verify
# coverage
N <- 100
lambda <- rep(1,N)

# set the sample size for importance sampling
B <- 10^4

# - estimate the mean value of f using importance sampling, N times
# - compute a confidence interval for the mean each time
# - CI.covers.parameter is set to TRUE if the estimated confidence 
#   interval contains the mean value computed by integrate, otherwise
# is set to FALSE
j <- 0
for(i in lambda){
  I <- importance.sampling(i, f, B)
  j <- j + 1
  mu <- mean(I)
  std <- sd(I)
  lower.CB <- mu - 1.96*std/sqrt(B)  
  upper.CB <- mu + 1.96*std/sqrt(B)  
  means[j] <- mu
  sigmas[j] <- std
  error[j] <- abs(mu-mu.num)
  CI.min[j] <- lower.CB
  CI.max[j] <- upper.CB
  CI.covers.parameter[j] <- lower.CB < mu.num & mu.num < upper.CB
}

# build a dataframe in case you want to have a look at the results for each run
df <- data.frame(lambda, means, sigmas, error, CI.min, CI.max, CI.covers.parameter)

# so, what's the coverage?
mean(CI.covers.parameter)
#[1] .95

如果你玩 $\lambda$ ，你会看到，如果你把它做得很小（.00001）或很大，覆盖概率会很差。

编辑 - - - -

关于一旦你离开，覆盖概率就会降低 $B = 10^4$ 到 $B = 10^6$ ，这只是一个随机事件，基于您使用的事实 $N = 100$ 复制。覆盖概率的置信区间 $B = 10^4$ 是，

.19 \pm 1.96 * \sqrt{\frac{.19 * (1 - .19)}{100}} = .19 \pm .0769 = (.1131, .2669) .

$.19 \pm 1.96*\sqrt{\dfrac{.19*(1-.19)}{100}} = .19 \pm .0769 = (.1131, .2669)\,.$

所以你不能真的说增加 $B = 10^6$ 显着降低覆盖概率。

事实上，在你的代码中为同一个种子，改变 $N = 100$ 到 $N = 1000$ ，然后与 $B = 10^4$ ，覆盖概率为 0.123，并且 $B = 10^6$ 覆盖概率是 $.158$ .

现在，0.123 附近的置信区间为

.123 \pm 1.96 \sqrt{\frac{.123 * (1 - .123)}{1000}} = .123 \pm .0203 = (.102, .143) .

$.123 \pm 1.96\sqrt{\dfrac{.123*(1 - .123)}{1000}} = .123 \pm .0203 = (.102, .143)\,.$

因此，现在有了 $N = 1000$ 复制，您会发现覆盖概率显着增加。

其它你可能感兴趣的问题

上一篇R中倾向得分匹配后的不同结果下一篇为什么论文很少报道在 Anova 结果中使用哪种类型的平方和？