缺失数据的期望最大化的 ML 协方差估计

机器算法验证 最大似然 协方差 缺失数据 期望最大化 数据插补
2022-04-18 06:31:29

假设具有缺失数据的多元正态分布,是否有一种直接的方法可以使用期望最大化算法找到协方差的最大似然估计?

注意:这个问题几乎解决了。剩下的唯一问题是每个观察(行)允许多个缺失变量。

例如,考虑一个给出的双变量数据2×25维矩阵Y. 第一列中随机缺少五个观察值,我们希望找到 ML 估计值μ^Σ^使用期望最大化。对数似然函数由下式给出

logL(θ)=.5(nlog(2π)+log|W|+(yXμ)W1(yXμ))
在哪里n是观察的总数,W=ΣI,y是个n-长度向量化形式的观测值Y(NA 值被省略),和X是个2n×2设计零和一的矩阵,表示是否索引iy对应列1或列2行数iX.

如果我理解正确,μ^可以使用期望最大化找到,其中缺失值的插补Y通过线性回归的预测迭代更新。当算法收敛时,缺失的估算值Y值等同于 ML 估算的估计值。然而,增广矩阵的协方差Yaug从插补中向下偏差。这在下面的可重现的 R 示例中得到了证明,其中 ML 协方差使用optim函数进行数值估计,并与 ML 协方差进行比较Yaug从 EM 获得。价值Σ1,1 相对于 ML 估计值太低,而Σ2,1=Σ1,2Σ2,2与 ML 估计值大致相同。有没有办法在避免数值优化的同时纠正向下偏差?

也就是说,我的目标是获得 ML 估计Σ没有数值优化,使用封闭形式的解决方案或 EM 算法本身。

# log likelihood function
LL <- function(sigma,Y)
{
  W <- sigma%x%diag(nrow(Y)) # covariance matrix
  isna <- which(is.na(Y)) # which data are NA
  if(length(isna)>0) W <- W[-isna,-isna] # remove NAs from W
  X <- matrix(0,nrow(Y)*ncol(Y),ncol(Y))
  for(i in 1:ncol(Y)) X[1:nrow(Y)+(i-1)*nrow(Y),i] <- 1 # design matrix
  if(length(isna)>0) X <- X[-isna,] # remove NAs from X

  # mean vector
  mu <- t(solve(t(X)%*%solve(W)%*%X)%*%t(X)%*%solve(W)%*%na.exclude(as.double(Y)))

  # log-likelihood
  logl <- -.5*(length(as.double(Y[!is.na(Y)]))*log(2*pi) + 
         determinant(W)$modulus[[1]] + 
         na.exclude(as.double(Y-matrix(1,nrow(Y))%*%mu)) %*% solve(W) %*% 
         na.exclude(as.double(Y-matrix(1,nrow(Y))%*%mu)))
  logl
}

# build 2x2 positive definite covariance matrix
buildmat <- function(x)
{
  mat <- matrix(0,2,2)
  mat[lower.tri(mat,diag=TRUE)] <- x
  tcrossprod(mat)
}

set.seed(4)
nmissing <- 5 # number of missing observations
dat <- matrix(rnorm(50),ncol=2) # original data
dat_missing <- dat
dat_missing[sample(nrow(dat),nmissing),1] <- NA # missing data in column 1

dat_impute <- dat_missing # data matrix for imputation
missing_ind <- which(is.na(dat_missing[,1])) # index of missing values
dat_impute[missing_ind,1] <- mean(na.exclude(dat_missing[,1]))

# EM imputation
while(sum(abs(dat_impute[missing_ind,1]-predict(lm(dat_impute[,1]~dat_impute[,2]))[missing_ind])>1e-32))
  dat_impute[missing_ind,1] <- predict(lm(dat_impute[,1]~dat_impute[,2]))[missing_ind]

# compare covariance of augmented data to ML estimate
cov(dat_impute)*(nrow(dat)-1)/nrow(dat)
buildmat(optim(chol(cov(na.exclude(dat)))[c(1,3,4)],function(par)
  LL(buildmat(par),dat_missing),control=list(fnscale=-1))$par)

编辑:显然,对于双变量数据有一个封闭形式的解决方案,其中只有一个变量缺少数据。我正在为所有变量都可能缺少数据的任何多元数据寻求更通用的解决方案。我知道应该可以获得校正后的 ML 协方差,因为 R 包norm可以做到这一点(继续上面的 R 示例):

require(norm) s <- prelim.norm(dat_missing) e <- em.norm(s) getparam.norm(s = s,theta = e)

但不幸的是,底层的FORTRAN 例程不够直观。

编辑 2:这是一个非常好的双变量数据教程关于将其推广到多变量的任何想法?

更新这是我最近的尝试。norm但是,我的实现似乎有问题,因为结果与输出不匹配。

进一步更新该代码现在适用于多变量数据,但前提是每个观测值只有一个缺失值(每行一个 NA)。如果给定的观察结果缺少一个以上的变量,则加权协方差似乎存在问题。

实施基于以下论文(通过 Randel):

Beale, EML, & Little, RJA。(1975 年)。多元分析中的缺失值。皇家统计学会杂志。B 系列(方法),37(1),129–145。

set.seed(4)
dat <- matrix(rnorm(100),ncol=5) # original data
nmissing <- 10

dat_missing <- dat
dat_missing[sample(length(dat_missing),nmissing)] <- NA
is_na <- apply(dat_missing,2,is.na) # index if NAs

dat_impute <- dat_missing # data matrix for imputation

# set initial estimates to means from available data
for(i in 1:ncol(dat_impute)) dat_impute[is_na[,i],i] <- colMeans(dat_missing,na.rm = TRUE)[i]

new_dat_impute <- dat_impute

# starting values for EM
means <- colMeans(dat_impute)
# NOTE: multiplying by (nrow-1)/(nrow) to get ML estimate
# For comparability with norm package output
sigma <- cov(dat_impute)*(nrow(dat_impute)-1)/nrow(dat_impute)

# matrix of regression coefficients -- one for each variable
betas <- matrix(0,ncol(sigma)-1,ncol(sigma))

# carry out EM over 1000 iterations
for(j in 1:1000)
{
  # get updated means and covariance matrix
  means <- colMeans(dat_impute)
  new_sigma <- cov(dat_impute)*(nrow(dat_impute)-1)/nrow(dat_impute)

  # correct for bias in covariance matrix
  for(i in 1:ncol(sigma))
  {
    mis <- crossprod(is_na)
    for(ii in 1:ncol(sigma))
    {
      if(i!=ii)
      {
        # partial correlation squared
        ##### This is where the problem is
        R2 <- (solve(sigma)[i,ii] / sqrt(solve(sigma)[i,i]*solve(sigma)[ii,ii]))^2
        new_sigma[i,ii] <- new_sigma[i,ii] + sigma[i,ii]*(1-R2)*(mis[i,ii])/nrow(dat_impute)
      } else
      {
        R2 <- t(sigma[-i,i]) %*% (solve(sigma[-i,-i]) %*% sigma[-i,i]) / sigma[i,i]
        new_sigma[i,ii] <- new_sigma[i,ii] + sigma[i,ii]*(1-R2)*mis[i,ii]/nrow(dat_impute)
      }
    }
  }
  sigma <- new_sigma

  # get estimates of beta and intercept if any missing data in column i
  for(i in 1:ncol(sigma))
  {
    if(any(is_na[,i]))
    {
      betas[,i] <- solve(sigma[-i,-i]) %*% sigma[-i,i]
      intercept <- means[i] - means[-i] %*% betas[,i]
      dat_impute[which(is_na[,i]),i] <- (cbind(1,dat_impute[,-i]) %*% c(intercept,betas[,i]))[which(is_na[,i]),1]
    }
  }  
}

# compare results to norm package output
s <- prelim.norm(dat_missing)
e <- em.norm(s,criterion=1e-32,showits = FALSE)

# compare means
max(abs(getparam.norm(s,e)[[1]] - means))
# compare covariance matrix
max(abs(getparam.norm(s,e)[[2]] - new_sigma))
1个回答

事实证明,该算法相当简单。从最初的估计开始μΣ

  1. 创建一个偏置矩阵B维度的nvar×nvar,用零初始化。
  2. 对于缺少数据的每一行(观察),表示可用索引a和缺失的索引m. 鉴于目前的估计Σμ, 估算缺失值y^m=μm+Σm,aΣa,a1(yaμa). 更新Bm,mBm,m+Σm,mΣm,aΣa,a1Σa,m
  3. 计算μ(i+1)Σbiased来自新估算的数据。调整偏差:Σ(i+1)=Σbiased+Bn1.

重复 1-3 直到收敛。对于受限的最大似然,替换n1(n1)1在协方差计算中。

require(norm)
dat <- matrix(rnorm(1000),ncol=5) # original data
nvar <- ncol(dat)
n <- nrow(dat)
nmissing <- 50

dat_missing <- dat
dat_missing[sample(length(dat_missing),nmissing)] <- NA
is_na <- apply(dat_missing,2,is.na) # index if NAs

dat_impute <- dat_missing # data matrix for imputation

# set initial estimates to means from available data
for(i in 1:ncol(dat_impute)) dat_impute[is_na[,i],i] <- colMeans(dat_missing,na.rm = TRUE)[i]

# starting values for EM
means <- colMeans(dat_impute)
# NOTE: multiplying by (nrow-1)/(nrow) to get ML estimate
# For comparability with norm package output
sigma <- cov(dat_impute)*(nrow(dat_impute)-1)/nrow(dat_impute)

# get estimates from norm package for comparison
s <- prelim.norm(dat_missing)
e <- em.norm(s,criterion=1e-32,showits = FALSE)

# carry out EM over 100 iterations
for(j in 1:100)
{
  bias <- matrix(0,nvar,nvar)
  for(i in 1:n)
  {
    row_dat <- dat_missing[i,]
    avail <- which(!is.na(row_dat))
    if(length(avail)<nvar)
    {
      bias[-avail,-avail] <- bias[-avail,-avail] + sigma[-avail,-avail] - sigma[-avail,avail] %*% solve(sigma[avail,avail]) %*% sigma[avail,-avail]
      dat_impute[i,-avail] <- means[-avail] + (sigma[-avail,avail] %*% solve(sigma[avail,avail])) %*% (row_dat[avail]-means[avail])
    }
  }

  # get updated means and covariance matrix
  means <- colMeans(dat_impute)
  biased_sigma <- cov(dat_impute)*(n-1)/n

  # correct for bias in covariance matrix
  sigma <- biased_sigma + bias/n
}


# compare results to norm package output
# compare means
max(abs(getparam.norm(s,e)[[1]] - means))
# compare covariance matrix
max(abs(getparam.norm(s,e)[[2]] - sigma))