假设具有缺失数据的多元正态分布,是否有一种直接的方法可以使用期望最大化算法找到协方差的最大似然估计?
注意:这个问题几乎解决了。剩下的唯一问题是每个观察(行)允许多个缺失变量。
例如,考虑一个给出的双变量数据维矩阵. 第一列中随机缺少五个观察值,我们希望找到 ML 估计值和使用期望最大化。对数似然函数由下式给出
如果我理解正确,可以使用期望最大化找到,其中缺失值的插补通过线性回归的预测迭代更新。当算法收敛时,缺失的估算值值等同于 ML 估算的估计值。然而,增广矩阵的协方差从插补中向下偏差。这在下面的可重现的 R 示例中得到了证明,其中 ML 协方差使用optim函数进行数值估计,并与 ML 协方差进行比较从 EM 获得。价值
相对于 ML 估计值太低,而和与 ML 估计值大致相同。有没有办法在避免数值优化的同时纠正向下偏差?
也就是说,我的目标是获得 ML 估计没有数值优化,使用封闭形式的解决方案或 EM 算法本身。
# log likelihood function
LL <- function(sigma,Y)
{
W <- sigma%x%diag(nrow(Y)) # covariance matrix
isna <- which(is.na(Y)) # which data are NA
if(length(isna)>0) W <- W[-isna,-isna] # remove NAs from W
X <- matrix(0,nrow(Y)*ncol(Y),ncol(Y))
for(i in 1:ncol(Y)) X[1:nrow(Y)+(i-1)*nrow(Y),i] <- 1 # design matrix
if(length(isna)>0) X <- X[-isna,] # remove NAs from X
# mean vector
mu <- t(solve(t(X)%*%solve(W)%*%X)%*%t(X)%*%solve(W)%*%na.exclude(as.double(Y)))
# log-likelihood
logl <- -.5*(length(as.double(Y[!is.na(Y)]))*log(2*pi) +
determinant(W)$modulus[[1]] +
na.exclude(as.double(Y-matrix(1,nrow(Y))%*%mu)) %*% solve(W) %*%
na.exclude(as.double(Y-matrix(1,nrow(Y))%*%mu)))
logl
}
# build 2x2 positive definite covariance matrix
buildmat <- function(x)
{
mat <- matrix(0,2,2)
mat[lower.tri(mat,diag=TRUE)] <- x
tcrossprod(mat)
}
set.seed(4)
nmissing <- 5 # number of missing observations
dat <- matrix(rnorm(50),ncol=2) # original data
dat_missing <- dat
dat_missing[sample(nrow(dat),nmissing),1] <- NA # missing data in column 1
dat_impute <- dat_missing # data matrix for imputation
missing_ind <- which(is.na(dat_missing[,1])) # index of missing values
dat_impute[missing_ind,1] <- mean(na.exclude(dat_missing[,1]))
# EM imputation
while(sum(abs(dat_impute[missing_ind,1]-predict(lm(dat_impute[,1]~dat_impute[,2]))[missing_ind])>1e-32))
dat_impute[missing_ind,1] <- predict(lm(dat_impute[,1]~dat_impute[,2]))[missing_ind]
# compare covariance of augmented data to ML estimate
cov(dat_impute)*(nrow(dat)-1)/nrow(dat)
buildmat(optim(chol(cov(na.exclude(dat)))[c(1,3,4)],function(par)
LL(buildmat(par),dat_missing),control=list(fnscale=-1))$par)
编辑:显然,对于双变量数据有一个封闭形式的解决方案,其中只有一个变量缺少数据。我正在为所有变量都可能缺少数据的任何多元数据寻求更通用的解决方案。我知道应该可以获得校正后的 ML 协方差,因为 R 包norm可以做到这一点(继续上面的 R 示例):
require(norm)
s <- prelim.norm(dat_missing)
e <- em.norm(s)
getparam.norm(s = s,theta = e)
但不幸的是,底层的FORTRAN 例程不够直观。
编辑 2:这是一个非常好的双变量数据教程。关于将其推广到多变量的任何想法?
更新这是我最近的尝试。norm但是,我的实现似乎有问题,因为结果与输出不匹配。
进一步更新该代码现在适用于多变量数据,但前提是每个观测值只有一个缺失值(每行一个 NA)。如果给定的观察结果缺少一个以上的变量,则加权协方差似乎存在问题。
实施基于以下论文(通过 Randel):
Beale, EML, & Little, RJA。(1975 年)。多元分析中的缺失值。皇家统计学会杂志。B 系列(方法),37(1),129–145。
set.seed(4)
dat <- matrix(rnorm(100),ncol=5) # original data
nmissing <- 10
dat_missing <- dat
dat_missing[sample(length(dat_missing),nmissing)] <- NA
is_na <- apply(dat_missing,2,is.na) # index if NAs
dat_impute <- dat_missing # data matrix for imputation
# set initial estimates to means from available data
for(i in 1:ncol(dat_impute)) dat_impute[is_na[,i],i] <- colMeans(dat_missing,na.rm = TRUE)[i]
new_dat_impute <- dat_impute
# starting values for EM
means <- colMeans(dat_impute)
# NOTE: multiplying by (nrow-1)/(nrow) to get ML estimate
# For comparability with norm package output
sigma <- cov(dat_impute)*(nrow(dat_impute)-1)/nrow(dat_impute)
# matrix of regression coefficients -- one for each variable
betas <- matrix(0,ncol(sigma)-1,ncol(sigma))
# carry out EM over 1000 iterations
for(j in 1:1000)
{
# get updated means and covariance matrix
means <- colMeans(dat_impute)
new_sigma <- cov(dat_impute)*(nrow(dat_impute)-1)/nrow(dat_impute)
# correct for bias in covariance matrix
for(i in 1:ncol(sigma))
{
mis <- crossprod(is_na)
for(ii in 1:ncol(sigma))
{
if(i!=ii)
{
# partial correlation squared
##### This is where the problem is
R2 <- (solve(sigma)[i,ii] / sqrt(solve(sigma)[i,i]*solve(sigma)[ii,ii]))^2
new_sigma[i,ii] <- new_sigma[i,ii] + sigma[i,ii]*(1-R2)*(mis[i,ii])/nrow(dat_impute)
} else
{
R2 <- t(sigma[-i,i]) %*% (solve(sigma[-i,-i]) %*% sigma[-i,i]) / sigma[i,i]
new_sigma[i,ii] <- new_sigma[i,ii] + sigma[i,ii]*(1-R2)*mis[i,ii]/nrow(dat_impute)
}
}
}
sigma <- new_sigma
# get estimates of beta and intercept if any missing data in column i
for(i in 1:ncol(sigma))
{
if(any(is_na[,i]))
{
betas[,i] <- solve(sigma[-i,-i]) %*% sigma[-i,i]
intercept <- means[i] - means[-i] %*% betas[,i]
dat_impute[which(is_na[,i]),i] <- (cbind(1,dat_impute[,-i]) %*% c(intercept,betas[,i]))[which(is_na[,i]),1]
}
}
}
# compare results to norm package output
s <- prelim.norm(dat_missing)
e <- em.norm(s,criterion=1e-32,showits = FALSE)
# compare means
max(abs(getparam.norm(s,e)[[1]] - means))
# compare covariance matrix
max(abs(getparam.norm(s,e)[[2]] - new_sigma))