机器算法验证 - 比较R中负二项分布的两个向量 - 吾爱随笔录

比较R中负二项分布的两个向量

机器算法验证 r 拟合优度负二项分布

2022-04-01 23:32:33

我正在使用 R 并且有两个离散值向量。它们不是严格意义上的分类，因为值本身是在细胞图像上计数的点数（整个向量是图像上的所有细胞）。有两个向量：参考和经过一些扰动后带有点数的向量

我认为这些数据应该遵循负二项分布，并且某种拟合优度应该给出一个 p 值和一些统计数据来描述这两个分布是否显着不同。

人们建议我的是卡方检验可以解决问题，但在我的理解中，卡方仅将所有值视为一个类别，并忽略了这些是数字的事实，如果假设具有 5 个点的单元格数量减少了一点，而增加了 4 个点的单元格与 0 点和 6 点类别发生相同情况的情况不同。

但是我没有找到可以处理负二项分布的测试。我希望我清楚地描述了这个问题。因此，如果有人知道任何可以处理此类数据的测试，或者如果有人认为我的假设是错误的，欢迎您分享您的想法。

示例 1

library(ggplot2)
c.dots = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 3, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
           0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 2, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2,
           0, 0, 0, 1, 0, 0, 1, 0, 0, 3, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1,
           0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0,
           0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 2, 0,
           0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0,
           0, 0, 1, 1, 0, 0, 1, 0, 2, 0, 1, 0, 2, 0, 0, 1, 0, 0, 1, 1, 0, 0, 3, 0, 0,
           0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0)

w.dots = c(0, 0, 0, 1, 3, 1, 1, 1, 1, 0, 0, 2, 0, 0, 2, 1, 0, 1, 3, 0, 1, 0, 0, 0, 2,
           0, 2, 2, 0, 3, 1, 2, 1, 0, 2, 1, 0, 2, 0, 1, 2, 1, 0, 0, 1, 0, 1, 1, 0, 0,
           0, 1, 1, 0, 2, 0, 0, 1, 3, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1,
           2, 4, 1, 0, 0, 2, 2, 0, 1, 0, 1, 3, 0, 2, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 0,
           1, 1, 0, 1, 0, 0, 2, 0, 1, 0, 2, 1, 0, 1, 2, 0, 4, 2, 0, 1, 0, 2, 0, 1, 2,
           1, 1, 2, 1, 1, 3, 1, 0, 1, 0, 1, 2, 0, 1, 2, 0, 1, 1, 2, 2, 0, 3, 0, 1, 1,
           0, 0, 2, 0, 1, 1, 0, 1, 2, 0, 0, 1, 0, 1, 2, 0, 0, 4, 3, 0, 1, 0, 0, 1, 0,
           4, 0, 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 1, 0, 3, 1, 1, 0, 4, 1, 1, 3)

chisq.test(rbind(table(w.dots), table(c.dots)))

nbrand = rnbinom(length(c.dots), mu = 1, size = 1)
ggplot() + 
  geom_density(aes(x=x), data=data.frame(x=c.dots), fill="red", alpha=0.5) + 
  geom_density(aes(x=x), data=data.frame(x=w.dots), fill="blue", alpha=0.5) +
  geom_density(aes(x=x), data=data.frame(x=nbrand), colour="green", alpha=0, linetype=3)

示例 2

library(ggplot2)
c.dots = c(1, 0, 0, 1, 0, 0, 3, 0, 1, 0, 3, 0, 2, 0, 0, 2, 2, 0, 0, 1, 1, 0, 0, 1, 0,
           0, 0, 0, 1, 0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
           1, 1, 1, 2, 0, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 3, 4,         
           0, 1, 1, 0, 1, 0, 2, 1, 2, 2, 3, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1,
           1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 2, 0, 0, 3, 2, 2, 1, 0, 2, 0, 2, 2, 0, 0, 2,
           1, 0, 2, 0, 0, 2, 2, 1, 0, 0, 0, 0, 0, 1, 0, 3, 0, 1, 0, 0, 1, 0, 0, 0, 0,
           2, 1, 1, 0, 1, 0, 1, 1, 0, 1, 3, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 2, 1, 0,
           1, 2, 0, 0, 3, 3, 0, 1, 2, 0, 0, 1, 1, 0, 1, 1, 3, 1, 3, 0, 2, 0, 0, 0, 0)

w.dots = c(1, 3, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 3, 0, 0, 0, 1, 2, 0,
           1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 5, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 2,
           0, 1, 0, 3, 0, 0, 1, 2, 3, 1, 0, 0, 0, 2, 1, 1, 2, 0, 2, 0, 3, 0, 2, 0, 0,
           0, 0, 2, 0, 1, 0, 2, 0, 0, 1, 1, 2, 3, 0, 2, 2, 1, 0, 1, 0, 0, 1, 0, 1, 0,
           0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 2, 0,
           0, 1, 2, 1, 1, 1, 2, 1, 2, 3, 2, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 1, 1,
           0, 0, 1, 2, 1, 0, 1, 2, 1, 1, 1, 0, 1, 0, 0, 0, 2, 1, 1, 1, 0, 0, 0, 0, 0,
           1, 2, 1, 2, 0, 1, 2, 1, 0, 1, 3, 2, 1, 0, 0, 0, 0, 0, 2, 1, 1, 2, 2, 1, 2)

chisq.test(rbind(table(w.dots), table(c.dots)))

nbrand = rnbinom(length(c.dots), mu = 1, size = 1)
ggplot() + 
  geom_density(aes(x=x), data=data.frame(x=c.dots), fill="red", alpha=0.5) + 
  geom_density(aes(x=x), data=data.frame(x=w.dots), fill="blue", alpha=0.5) +
  geom_density(aes(x=x), data=data.frame(x=nbrand), colour="green", alpha=0, linetype=3)

4个回答

您的因变量是一个计数（“在细胞图像上计数的点数”）。询问两个组的计数分布是否相似在概念上与询问组成员身份是否对计数分布很重要。

我建议将泊松回归作为第一步，用组成员身份对点数进行建模。在第二步中，人们可能会尝试确定“条件方差 = 条件均值”的泊松假设是否被违反，这表明转向准泊松模型，转向具有异方差一致 (HC) 标准误差的泊松模型估计，或负二项式模型。

给定数据c.dots，w.dots如 OP 的示例 1 所示：我们首先创建一个数据框，其中预测变量Y= 点数，预测变量X= 具有组成员资格的因子。然后我们运行一个标准的泊松回归

> dotsDf <- data.frame(Y=c(c.dots, w.dots),
+                      X=factor(rep(c("c", "w"), c(length(c.dots), length(w.dots)))))

> glmFitP <- glm(Y ~ X, family=poisson(link="log"), data=dotsDf)
> summary(glmFitP)                      # Poisson model
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9289     0.1125  -8.256  < 2e-16 ***
Xw            0.8455     0.1345   6.286 3.26e-10 ***

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 509.01  on 399  degrees of freedom
Residual deviance: 465.90  on 398  degrees of freedom
AIC: 862.16

这表明一个显着的预测变量“组成员资格 = w”是由对分组因子进行虚拟编码产生的（2 个组 => 1 个虚拟预测变量，c是参考水平）。为了比较，我们可以运行准泊松模型，该模型具有额外的条件方差分散参数。

> glmFitQP <- glm(Y ~ X, family=quasipoisson(link="log"), data=dotsDf)
> summary(glmFitQP)                     # quasi-Poisson model
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.9289     0.1298  -7.158 3.96e-12 ***
Xw            0.8455     0.1551   5.450 8.85e-08 ***

(Dispersion parameter for quasipoisson family taken to be 1.330164)

    Null deviance: 509.01  on 399  degrees of freedom
Residual deviance: 465.90  on 398  degrees of freedom
AIC: NA

参数估计值相同，但这些估计值的标准误差稍大。估计的色散参数略大于 1（泊松模型中的值），表明存在一些过度色散。另一种方法是使用具有 HC 一致标准误差的 Poisson 模型：

> library(sandwich)                     # for vcovHC()
> library(lmtest)                       # for coeftest()
> hcSE <- vcovHC(glmFitP, type="HC0")   # HC-consistent standard errors
> coeftest(glmFitP, vcov=hcSE)
z test of coefficients:

            Estimate Std. Error z value  Pr(>|z|)    
(Intercept) -0.92887    0.14084 -6.5952 4.246e-11 ***
Xw           0.84549    0.16033  5.2735 1.339e-07 ***

同样，稍大的标准误差。现在负二项式模型：

> library(MASS)                         # for glm.nb()
> glmFitNB <- glm.nb(Y ~ X, data=dotsDf)
> summary(glmFitNB)                     # negative binomial model
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9289     0.1192  -7.791 6.66e-15 ***
Xw            0.8455     0.1456   5.806 6.40e-09 ***

(Dispersion parameter for Negative Binomial(3.212) family taken to be 1)

    Null deviance: 427.19  on 399  degrees of freedom
Residual deviance: 391.20  on 398  degrees of freedom
AIC: 856.87

              Theta:  3.21 
          Std. Err.:  1.46 

 2 x log-likelihood:  -850.867

您可以在似然比检验中针对泊松模型检验负二项式模型以进行模型比较：

> library(pscl)                         # for odTest()
> odTest(glmFitNB)
Likelihood ratio test of H0: Poisson, as restricted NB model:
n.b., the distribution of the test-statistic under H0 is non-standard

Critical value of test statistic at the alpha= 0.05 level: 2.7055 
Chi-Square Test Statistic =  7.2978 p-value = 0.003452

这里的结果表明数据不太可能来自泊松模型。

对于 OP 的示例 2，所有这些测试都不重要。

请注意，我稍微缩短了glm()和的输出glm.nb()。

首先，为@caracal 的回答+1。

另一种可能性如下：将 negbin 分布拟合到两个数据集（例如，fitdistr在MASS包中使用）。参数估计是（渐近）正态分布的，我们可以这样得到协方差估计：

vcov(fitdistr(w.dots,densfun="negative binomial"))
vcov(fitdistr(c.dots,densfun="negative binomial"))

不幸的是，我找不到一个很好的方法来测试零假设“两个向量都是由相同的 negbin 参数向量生成的”，这应该是标准一维二维样本 t 检验的多元泛化与合并方差。这可能会有所帮助，但我无权访问。更新：我确实想到了一些东西，见下文。

然而，我们可以做的是引导拟合并绘制引导拟合：

library(boot)
library(MASS)
n.boot <- 1000
w.fit.boot <- boot(w.dots,R=n.boot,
  statistic=function(xx,index)fitdistr(xx[index],densfun="negative binomial")$estimate)
c.fit.boot <- boot(c.dots,R=n.boot,
  statistic=function(xx,index)fitdistr(xx[index],densfun="negative binomial")$estimate)

plot(c.fit.boot$t,pch=21,bg="black",xlab="size",ylab="mu",log="xy",cex=0.5,
  xlim=range(rbind(w.fit.boot$t,c.fit.boot$t)[,1]),
  ylim=range(rbind(w.fit.boot$t,c.fit.boot$t)[,2]))
abline(v=c.fit.boot$t0[1]); abline(h=c.fit.boot$t0[2])
points(w.fit.boot$t,pch=21,bg="red",col="red",cex=0.5)
abline(v=w.fit.boot$t0[1],col="red"); abline(h=w.fit.boot$t0[2],col="red")
legend(x="bottomright",inset=.01,pch=21,col=c("black","red"),pt.bg=c("black","red"),
  legend=c("c.dots","w.dots"))

每个点对应一个自举拟合的 negbin 参数向量。水平线和垂直线给出了原始数据的拟合参数。结果：引导结果示例 1 引导结果示例 2

在我看来，示例 1 中的数据很明显不是来自同一分布，而示例 2 中的数据可能很好。

当然，这一切都取决于 negbin 分布是否正确。但是，可以使用泊松模型或其他解释过度分散的方法进行类似的练习。

更新：现在，我们要测试基于两个原始向量的 negbin 参数估计的联合分布之间的重叠量。让我们从一个简单的一维示例开始。假设我们已经估计了两个正常密度的均值和标准差。在原假设下，这两个密度是相同的。因此，一种可能的测试统计量是密度曲线下的重叠量：

means <- c(1,2)
sds <- c(.2,.3)
xx <- seq(means[1]-3*sds[1],means[2]+3*sds[2],by=.001)
plot(xx,dnorm(xx,means[1],sds[1]),type="l",xlab="",ylab="")
lines(xx,dnorm(xx,means[2],sds[2]))
polygon(x=c(xx,rev(xx)),y=c(rep(0,length(xx)),
  rev(pmin(dnorm(xx,means[1],sds[1]),dnorm(xx,means[2],sds[2])))),col="grey")

一维例子

因此，我们使用简单的数值积分来计算这个区域：

sum(pmin(dnorm(xx,means[1],sds[1]),dnorm(xx,means[2],sds[2])))*mean(diff(xx))
[1] 0.04443207

我们可以很容易地将这种方法推广到二维。在这种情况下，我们必须考虑估计的均值及其协方差椭圆并覆盖二维网格进行积分。我将使用上面的引导值来确定网格尺寸。

library(mvtnorm)
nn <- 100
xx <- seq(min(c(c.fit.boot$t[,1],w.fit.boot$t[,1])),
  max(c(c.fit.boot$t[,1],w.fit.boot$t[,1])),length.out=nn)
yy <- seq(min(c(c.fit.boot$t[,2],w.fit.boot$t[,2])),
  max(c(c.fit.boot$t[,2],w.fit.boot$t[,2])),length.out=nn)
integral <- matrix(NA,nrow=nn,ncol=nn)
for ( ii in 1:nn ) {
  for ( jj in 1:nn ) {
    integral[ii,jj] <-
      min(dmvnorm(c(xx[ii],yy[jj]),mean=c.fit$estimate,sigma=vcov(c.fit)),
        dmvnorm(c(xx[ii],yy[jj]),mean=w.fit$estimate,sigma=vcov(w.fit)))
  }
}
sum(integral)*mean(diff(xx))*mean(diff(yy))
[1] 6.673166e-05  # for the data in example 1

当然，一维和二维积分都可以用笔和纸和/或计算机代数系统进行。integrate()在 R 适用于一维情况下，我想 R 中有工具可以集成二维函数。此外，我假设我会为上面的双循环获得一些反对意见，但我无法让矢量化函数（outer()或mapply()）工作 - 欢迎任何评论。最后，等间距的网格可能不是进行这种集成的最准确方法。

完全矢量化的实现很简单，“积分”不必是 2D 矩阵：

library(mvtnorm)
nn <- 100
xx <- seq(min(c(c.fit.boot$t[,1],w.fit.boot$t[,1])),
  max(c(c.fit.boot$t[,1],w.fit.boot$t[,1])),length.out=nn)
yy <- seq(min(c(c.fit.boot$t[,2],w.fit.boot$t[,2])),
  max(c(c.fit.boot$t[,2],w.fit.boot$t[,2])),length.out=nn)
#vectorized:
xy_pair <- cbind(rep(xx, each=nn, times=1), rep(yy, each=1, times=nn))
integral <- min(dmvnorm(xy_pair,mean=c.fit$estimate,sigma=vcov(c.fit)),
        dmvnorm(xy_pair,mean=w.fit$estimate,sigma=vcov(w.fit)))
sum(integral)*mean(diff(xx))*mean(diff(yy))

目前的答案假设两个分布实际上都是负二项式 (NB)，并且还假设两个样本是独立的。

在此框架中，您有一个具有四个参数的模型，每个分布两个参数，并且想要测试具有两个参数约束的约束。这可以通过标准最大似然 (ML) 理论来完成。似然比检验(LR) 很简单： Stephan Kolassa 在上面使用的fitdistrMASS 函数可以提供 LR 所需的最大似然。另一种测试称为分数测试或拉格朗日乘数测试（LM）；它具有很好的理论特性，但需要一个信息矩阵和一个来自受限 ML 估计的得分向量。第三种可能性是Wald 测试。

设是“c”的参数向量样本，“w”的符号类似。这里代表，而代表R 中的成功概率。完整的参数向量是。您想检验假设。LM 检验基于近似分布其中自由度 $\boldsymbol{\psi}_{\textrm{c}} := [r_{\textrm{c}},\,\pi_{\textrm{c}}]'$ $r$ size $\pi$ prob $\boldsymbol{\theta} := [\boldsymbol{\psi}_{\textrm{c}}',\, \boldsymbol{\psi}_{\textrm{w}}']'$ $H_0:\,\boldsymbol{\psi}_{\textrm{c}} = \boldsymbol{\psi}_{\textrm{w}}$

U^{'} I^{- 1} U \sim χ^{2} (d)

$\mathbf{U}'\mathbf{I}^{-1}\mathbf{U} \sim \chi^2(d)$

d

$d$ 是标量限制的数量，这里是。向量和矩阵是受限估计下的得分向量和信息矩阵，比如。

2

$2$

U

$\mathbf{U}$

I

$\mathbf{I}$

{\hat{θ}}_{0} = [{\hat{ψ}}_{0}^{'}, {\hat{ψ}}_{0}^{'}]^{'}

$\widehat{\boldsymbol{\theta}}_{0}= [{\widehat{\boldsymbol{\psi}}_{0}}',\, {\widehat{\boldsymbol{\psi}}_{0}}']'$

这里是长度为的向量，是。LR 统计量可以通过注意到是块对角线与两个块导致两个样本产生的两个贡献来获得。分数和 观察到的信息矩阵可以以封闭的形式使用。使用该函数，参数化与上面不同，并使用。与其改变参数，我们可以多花一点力气使用集中对数似然。 $\mathbf{U}$ $4$ $\mathbf{I}$ $4 \times 4$ $\mathbf{I}$ $2\times 2$ fitdistr $\mu:= r\times (1-\pi)/\pi$

这里的两个检验具有非常小的值，因此应该明确拒绝原假设。 $p$

使用模拟，似乎信息矩阵在极少数情况下可能是病态的，然后 LM 检验统计量可能会过度变为负数。至少在这种情况下，LR 测试必须优于 LM 测试。 $\mathbf{I}$

## fit NB distr from a sample X using concentrated logLik
fitNB <- function(X) {
  n <- length(X)
  loglik.conc <- function(r) {
    prob <- n*r / (sum(X) + n*r)
    sum( lgamma(r + X) - lgamma(r) - lgamma(X + 1) +
        r * log(prob) + X * log(1 - prob) ) 
  }
  ## find 'r' with an 1D optim...
  res <- optimize(f = loglik.conc, interval = c(0.001, 1000),
                  maximum = TRUE)
  r <- res$maximum[1]
      params <- c(size = r, prob = n*r / (sum(X) + n*r))
      attr(params, "logLik") <- res$objective[1]
  params
}
## compute score vector and info matrix at params 'psi' using closed forms
scoreAndInfo <- function(psi, X) {
  size <- psi[1]; prob <- psi[2]
  n <- length(X)
  U <- c(sum(digamma(size + X) - digamma(size) + log(prob)),  
         sum(size / prob - X / (1-prob) ))
  I <- matrix(c(- sum(trigamma(size + X) - trigamma(size)),  
                -n / prob, -n / prob,  
                sum( size / prob^2  + X / (1-prob)^2)),
              nrow = 2, ncol = 2)
  names(U) <- rownames(I) <- colnames(I) <- c("size", "prob")
  LM <-  as.numeric(t(U) %*% solve(I) %*% U)
  list(score = U, info = I, LM = LM)
}
## continuing on the question code a is for "all" 
c.fit <- fitNB(X = c.dots)
w.fit <- fitNB(X = w.dots)
a.fit <- fitNB(X = c(c.dots, w.dots))
## LR test and p.value
D.LR <- 2 * ( attr(c.fit, "logLik") + attr(w.fit, "logLik") ) -
  2 * attr(a.fit, "logLik") 
p.LR <- pchisq(D.LR, df = 2, lower.tail = FALSE)
## use restricted parameter estimate to compute the LM contributions
c.sI <- scoreAndInfo(psi = a.fit, X = c.dots) 
w.sI <- scoreAndInfo(psi = a.fit, X = w.dots) 
D.LM <- c.sI$LM + w.sI$LM 
p.LM <- pchisq(D.LM, df = 2, lower.tail = FALSE)

您可以使用贝叶斯技术。http://www.indiana.edu/~kruschke/BEST/ 这为您提供了用于比较的样本均值的后验分布。

x<-c(rep(1,200),rep(2,200))
y<-c(c.dots,w.dots)

require(R2jags)  #open jags console.
dataList<-list(x=x,y=y, Ntotal=length(y))

modelstring = "
model {
for ( i in 1:Ntotal ) {
y[i] ~ dnegbin( p[x[i]] , r[x[i]] )
}
for ( j in 1:2) {
p[j] <- r[j]/(r[j]+m[j])
m[j] ~ dgamma(0.01, 0.01)
r[j] ~ dgamma(0.01, 0.01)
v[j] <- r[j]*(1-p[j])/(p[j]*p[j])
}
}
"
writeLines(modelstring,con="model.txt")

parameters=c("m")
adaptSteps = 1000              
burnInSteps = 1000               
nChains = 1                   
numSavedSteps=10000           
thinSteps=1
nPerChain = ceiling( ( numSavedSteps * thinSteps ) / nChains ) 

JagsModel = jags.model( "model.txt" , data=dataList  , 
                        n.chains=nChains , n.adapt=adaptSteps )

codaSamples = coda.samples( JagsModel , variable.names=parameters ,
                            n.iter=nPerChain , thin=thinSteps )

m <- as.matrix(codaSamples)
head(m)
plot(density(m[,1]))
lines(density(m[,2]))

其它你可能感兴趣的问题

上一篇我如何在数学上证明 k 均值聚类收敛到最小平方误差？下一篇区域的时间趋势可视化