机器算法验证 - 从删失数据估计分布 - 吾爱随笔录

从删失数据估计分布

机器算法验证分布估计审查

2022-04-08 13:18:43

$X$ 是具有已知支持的正变量（假设离散支持，如果这简化了解决方案）。

$Y$ 是另一个具有相同支持的变量。

$X$ 和 $Y$ 是独立的。

$Z$ 等于 $X$ 如果 $X < Y$ ，并且等于 $0$ 否则。

$Y$ 和 $Z$ 被观察到， $X$ 不是。如何估计分布 $X$ ?

我意识到可能没有单一的客观最优答案。如果贝叶斯推理需要，假设先前的估计。

1个回答

在 R 中：

estimate=function(y,z,u=1e-9){
  ys=sort(unique(y))
  # Inf signifies x's never observed (as they are higher than max y)
  zs=c(sort(unique(z))[-1],Inf)
  counts=xtabs(~z+y)
  observed=rbind(counts[-1,],rep(0,length(ys)))
  marginalHidden=counts[1,]
  m=sapply(seq(ys),function(i)zs>ys[i])
  d=rep(1/length(zs),length(zs))
  while(T){
    # allocate hidden data according to current parameters
    p=apply(m*d,2,function(v)v/sum(v))
    # can result in fractional counts
    hidden=sweep(p,2,marginalHidden,'*')
    total=observed+hidden
    d2=apply(total,1,sum)/sum(total)
    msd=mean((d2-d)^2)
    if(msd<u^2)
      break;
    d=d2
  }
  d
}

xSupport=c(3,5,7)
xDistribution=c(1/4,1/2,1/4)
x=sample(xSupport,1000,replace=T,prob=xDistribution)
ySupport=c(4,6)
yDistribution=c(1/2,1/2)
y=sample(ySupport,length(x),replace=T,prob=yDistribution)
z=ifelse(x<y,x,0)

estimate(y,z)
table(x)

编辑

一种直接（非迭代）解决方案，与上面给出的解决方案兼容。这个想法是从价值观开始 $Z$ 永远不会隐藏的（低于 $min(Y)$ )，并根据比例估计它们的概率。之后，这两个值和 $min(Y)$ 可以从问题中删除。因此，问题变得越来越小。

estimate=function(y,z,u=1e-9){
  ys=sort(unique(y))
  # Inf signifies x's never observed (as they are higher than max y)
  z[z==0]=Inf
  zs=sort(unique(z))
  counts=xtabs(~z+y)
  s=c()
  r=1
  while(ncol(counts)>0){
    # zs < min(ys) are all observed, so can be estimated from counts
    mzi=which(zs<min(ys))
    ds=r*apply(counts[mzi,,drop=F],1,sum)/sum(counts)
    s=c(s,ds)

    # reduce probability remaining for the hidden cases
    r=r-sum(ds)

    # reduce the problem by removing the solved levels of zs, and the min(ys)
    counts=counts[-mzi,-1,drop=F]
    zs=zs[-mzi]
    ys=ys[-1]
  }
  c(s,r)
}

其它你可能感兴趣的问题

上一篇统计量的 Fisher 信息下一篇多指标中介分析