如何计算经验概率密度之间的重叠?

机器算法验证 r 可能性 密度函数 内核平滑
2022-02-04 15:48:40

我正在寻找一种方法来计算 R 中两个核密度估计之间的重叠区域,作为两个样本之间相似性的度量。为了澄清,在下面的例子中,我需要量化紫色重叠区域的面积:

library(ggplot2)
set.seed(1234)
d <- data.frame(variable=c(rep("a", 50), rep("b", 30)), value=c(rnorm(50), runif(30, 0, 3)))
ggplot(d, aes(value, fill=variable)) + geom_density(alpha=.4, color=NA)

在此处输入图像描述

这里讨论了一个类似的问题,不同之处在于我需要为任意经验数据而不是预定义的正态分布执行此操作。overlap软件包解决了这个问题,但显然仅适用于时间戳数据,这对我不起作用。Bray-Curtis 指数(在vegan包的vegdist(method="bray")函数中实现)似乎也相关,但同样适用于有些不同的数据。

我对理论方法和我可能用来实现它的 R 函数都感兴趣。

4个回答

两个核密度估计的重叠区域可以近似为任何所需的准确度。

1) 由于原始 KDE 可能已经在某个网格上进行了评估,如果两者的网格相同(或者可以很容易地制作相同),那么练习可以像简单地取在每个点,然后使用梯形规则,甚至中点规则。min(K1(x),K2(x))

如果两者在不同的网格上并且不能在同一个网格上轻松重新计算,则可以使用插值。

2)您可能会找到交点(或多个点),并将两个 KDE 中的较低者积分在每个区间中每个较低的 KDE。在上图中,您可以通过任何您喜欢/可用的方式将蓝色曲线整合到交叉点的左侧,将粉红色的曲线整合到右侧。这可以通过考虑每个内核组件下到该截止点左侧或右侧的区域来精确地完成。1hK(xxih)

但是,上面 whuber 的评论应该牢记在心——这不一定是一件非常有意义的事情。

为了完整起见,这就是我最终在 R 中这样做的方式:

# simulate two samples
a <- rnorm(100)
b <- rnorm(100, 2)

# define limits of a common grid, adding a buffer so that tails aren't cut off
lower <- min(c(a, b)) - 1 
upper <- max(c(a, b)) + 1

# generate kernel densities
da <- density(a, from=lower, to=upper)
db <- density(b, from=lower, to=upper)
d <- data.frame(x=da$x, a=da$y, b=db$y)

# calculate intersection densities
d$w <- pmin(d$a, d$b)

# integrate areas under curves
library(sfsmisc)
total <- integrate.xy(d$x, d$a) + integrate.xy(d$x, d$b)
intersection <- integrate.xy(d$x, d$w)

# compute overlap coefficient
overlap <- 2 * intersection / total

如前所述,在 KDE 生成和集成中存在固有的不确定性和主观性。

首先,我可能是错的,但我认为如果内核密度估计(KDE)相交有多个点,您的解决方案将不起作用。其次,尽管该overlap包是为使用时间戳数据而创建的,但您仍然可以使用它来估计任何两个 KDE 的重叠区域。您只需重新缩放数据,使其范围从 0 到 2π。
举个例子 :

# simulate two sample    
 a <- rnorm(100)
 b <- rnorm(100, 2)

# To use overplapTrue(){overlap} the scale must be in radian (i.e. 0 to 2pi)
# To keep the *relative* value of a and b the same, combine a and b in the
# same dataframe before rescaling. You'll need to load the ‘scales‘ library.
# But first add a "Source" column to be able to distinguish between a and b
# after they are combined.
 a = data.frame( value = a, Source = "a" )
 b = data.frame( value = b, Source = "b" )
 d = rbind(a, b)
 library(scales) 
 d$value <- rescale( d$value, to = c(0,2*pi) )

# Now you can created the rescaled a and b vectors
 a <- d[d$Source == "a", 1]
 b <- d[d$Source == "b", 1]

# You can then calculate the area of overlap as you did previously.
# It should give almost exactly the same answers.
# Or you can use either the overlapTrue() and overlapEst() function 
# provided with the overlap packages. 
# Note that with these function the KDE are fitted using von Mises kernel.
 library(overlap)
  # Using overlapTrue():
   # define limits of a common grid, adding a buffer so that tails aren't cut off
     lower <- min(d$value)-1 
     upper <- max(d$value)+1
   # generate kernel densities
     da <- density(a, from=lower, to=upper, adjust = 1)
     db <- density(b, from=lower, to=upper, adjust = 1)
   # Compute overlap coefficient
     overlapTrue(da$y,db$y)


  # Using overlapEst():            
    overlapEst(a, b, kmax = 3, adjust=c(0.8, 1, 4), n.grid = 500)

# You can also plot the two KDEs and the region of overlap using overlapPlot()
# but sadly I haven't found a way of changing the x scale so that the scale 
# range correspond to the initial x value and not the rescaled value.
# You can only change the maximum value of the scale using the xscale argument 
# (i.e. it always range from 0 to n, where n is set with xscale = n).
# So if some of your data take negative value, you're probably better off with
# a different plotting method. You can change the x label with the xlab
# argument.  
  overlapPlot(a, b, xscale = 10, xlab= "x metrics", rug=T)

经验估计的另一种方法是使用 ROC(接收器操作曲线)技术进行估计。Youden 阈值为我们提供了主要交点的经验估计(参见https://journals.lww.com/epidem/Fulltext/2005/01000/Optimal_Cut_point_and_Its_Corresponding_Youden.11.aspxhttps://math.stackexchange.com/问题/2404750/intersection-normal-distributions-and-minimal-decision-error/2435957#2435957)。

Youden 阈值是使测试灵敏度和特异性之和最大化并且错误率(假阳性率和假阴性率)之和最小化的阈值。重叠等于这个错误率的最小总和。

library(UncertainInterval)
simple_roc2 <- function(ref, test){
  tab = table(test, ref) # head(tab)
  data.frame(threshold=paste('>=',rownames(tab)), 
             ref0 = tab[,1], 
             ref1 = tab[,2],  
             FPR = rev(cumsum(rev(tab[,1])/sum(tab[,1]))), # 1-Sp
             TPR = rev(cumsum(rev(tab[,2])/sum(tab[,2]))), # Se
             row.names=1:nrow(tab))
}
a <- rnorm(10000)
b <- rnorm(10000, 2)
test=c(a,b)
ref=c(rep(0, length(a)), rep(1, length(b)))
# table(test, ref)
res = simple_roc2(ref, test)
res$FNR = 1-res$TPR # 1-Se
pos.optimal.threshold = which.min(res$FPR+res$FNR)
optimal.threshold=row.names(table(test, ref))[pos.optimal.threshold] # Youden threshold
plotMD(ref, test) # library(UncertainInterval) # includes kernel intersection estimate
abline(v=optimal.threshold, col='red')
overlap1(a, b)
(overlap2 = min(res$FPR+res$FNR))

在这种情况下,这种非参数估计具有对真实值估计不足的轻微趋势。这种 roc-technique 只处理一个(主要)交叉点。它不依赖于任何特定的分布。确保分布 b 具有较高的值 (mean(b) > mean(a))。

反复观察 plotMD 生成的图表明,对于 2 * 100 个案例,样本重叠差异很大。大多数差异是由于样本差异造成的,但是,取决于分布,所有方法都有不能正常工作的条件。使用高斯核密度对数据中的峰值很敏感,可能会被低估。核密度方法取决于赋予密度函数的微调参数。roc 方法没有参数,但它假设一个交点。因此,当存在额外的交叉点时,它可能会高估重叠(关键点是存在多个交叉点,而不是方差)。当该次要交点位于两个分布的尾部时,这种高估可能可以忽略不计。

如何理解不同的方法和建议?当我们知道两个分布的真实值时,设计一个测试是最简单的。两个正态分布的重叠真实值很容易计算。交点只是分布的两个均值的平均值,因为它们具有相等的方差: 1. 真正的重叠是 0.3173105:

(true.overlap = pnorm(1,2,1)+ 1-pnorm(1,0,1))

有关计算两个正态分布的交点的一般方法,请参阅https://stackoverflow.com/questions/16982146/point-of-intersection-2-normal-curves/45184024#45184024 。

在原始问题中,存在正态分布和均匀分布的混合。在这种情况下,真正的价值是:

    true.value=sum(pmin(diff(pnorm(0:3)),1/3)) 

运行模拟可以向我们展示哪种估计方法产生的估计值最接近真实值:

library(sfsmisc)
overlap1 <- function(a,b){
  lower <- min(c(a, b)) - 1 
  upper <- max(c(a, b)) + 1
  
  # generate kernel densities
  da <- density(a, from=lower, to=upper)
  db <- density(b, from=lower, to=upper)
  d <- data.frame(x=da$x, a=da$y, b=db$y)
  
  # calculate intersection densities
  d$w <- pmin(d$a, d$b)
  
  # integrate areas under curves
  total <- integrate.xy(d$x, d$a) + integrate.xy(d$x, d$b)
  intersection <- integrate.xy(d$x, d$w)
  
  # compute overlap coefficient
  2 * intersection / total
}

library(overlap)
library(scales)
# For explanation of the next function see the answer of S. Venne
overlapEstimates =function(a, b){

  a = data.frame( value = a, Source = "a" )
  b = data.frame( value = b, Source = "b" )
  d = rbind(a, b)
  
  d$value <- scales::rescale( d$value, to = c(0,2*pi) )
  
  a <- d[d$Source == "a", 1]
  b <- d[d$Source == "b", 1]
  
  overlapEst(a, b, kmax = 3, adjust=c(0.8, 1, 4), n.grid = 500)
}

nsim=1000; nobs=100; m1=4; sd1=1; m2=6; sd2=1; poi=5
(true.overlap= 1-pnorm(poi, m1, sd1)+pnorm(poi,m2,sd2))
out=matrix(NA,nrow=nsim,ncol=4)
set.seed(0)
for (i in 1:nsim){
  x <- rnorm( nobs, m1, sd1 )
  y <- rnorm( nobs, m2, sd2 )
  
  out[i,1] = overlap1(x,y)
  out[i,2] = overlapping::overlap(list( x = x, y = y ))$OV
  out[i,3] = overlapEstimates(x,y)['Dhat4']
  out[i,4] = roc.overlap(x,y)
}
(true.overlap=pnorm(poi,m2,sd2)+1-pnorm(poi,m1,sd1))
colMeans(out-true.overlap) # estimation errors
apply(out, 2, sd) # # sd of the estimation errors
apply(out, 2, range)-true.overlap
par(mfrow=c(2,2))
br = seq(-.33,+.33,by=0.05)
hist(out[,1]-true.overlap, breaks=br, ylim=c(0,500), 
     xlim=c(-.33,.33), main='overlap1'); 
abline(v=0, col='red')
hist(out[,2]-true.overlap, breaks=br, ylim=c(0,500), 
     xlim=c(-.33,.33), main='overlapping::overlap')
abline(v=0, col='red')
hist(out[,3]-true.overlap, breaks=br, ylim=c(0,600), 
     xlim=c(-.33,.33), main='overlap::overlapEst')
abline(v=0, col='red')
hist(out[,4]-true.overlap, breaks=br, ylim=c(0,500), 
     xlim=c(-.33,.33), main="ROC estimate"); 
abline(v=0, col='red')

估计误差直方图

在这种情况下,尤其是函数overlap::overlap 有(轻微)低估的趋势,而overlap1 显示出最小的估计误差。以一种或另一种方式使用密度函数的估计可以产生更好或更差的结果,具体取决于给定密度函数的参数。roc 方法没有参数,这可能是一个优势。

仔细查看重叠分布图并设计相关的测试方法始终是明智的,重叠估计技术对于您拥有的数据类型是否按预期工作。尤其是系统地产生通常太低或太高的估计值的技术最好不要使用。