经验估计的另一种方法是使用 ROC(接收器操作曲线)技术进行估计。Youden 阈值为我们提供了主要交点的经验估计(参见https://journals.lww.com/epidem/Fulltext/2005/01000/Optimal_Cut_point_and_Its_Corresponding_Youden.11.aspx和https://math.stackexchange.com/问题/2404750/intersection-normal-distributions-and-minimal-decision-error/2435957#2435957)。
Youden 阈值是使测试灵敏度和特异性之和最大化并且错误率(假阳性率和假阴性率)之和最小化的阈值。重叠等于这个错误率的最小总和。
library(UncertainInterval)
simple_roc2 <- function(ref, test){
tab = table(test, ref) # head(tab)
data.frame(threshold=paste('>=',rownames(tab)),
ref0 = tab[,1],
ref1 = tab[,2],
FPR = rev(cumsum(rev(tab[,1])/sum(tab[,1]))), # 1-Sp
TPR = rev(cumsum(rev(tab[,2])/sum(tab[,2]))), # Se
row.names=1:nrow(tab))
}
a <- rnorm(10000)
b <- rnorm(10000, 2)
test=c(a,b)
ref=c(rep(0, length(a)), rep(1, length(b)))
# table(test, ref)
res = simple_roc2(ref, test)
res$FNR = 1-res$TPR # 1-Se
pos.optimal.threshold = which.min(res$FPR+res$FNR)
optimal.threshold=row.names(table(test, ref))[pos.optimal.threshold] # Youden threshold
plotMD(ref, test) # library(UncertainInterval) # includes kernel intersection estimate
abline(v=optimal.threshold, col='red')
overlap1(a, b)
(overlap2 = min(res$FPR+res$FNR))
在这种情况下,这种非参数估计具有对真实值估计不足的轻微趋势。这种 roc-technique 只处理一个(主要)交叉点。它不依赖于任何特定的分布。确保分布 b 具有较高的值 (mean(b) > mean(a))。
反复观察 plotMD 生成的图表明,对于 2 * 100 个案例,样本重叠差异很大。大多数差异是由于样本差异造成的,但是,取决于分布,所有方法都有不能正常工作的条件。使用高斯核密度对数据中的峰值很敏感,可能会被低估。核密度方法取决于赋予密度函数的微调参数。roc 方法没有参数,但它假设一个交点。因此,当存在额外的交叉点时,它可能会高估重叠(关键点是存在多个交叉点,而不是方差)。当该次要交点位于两个分布的尾部时,这种高估可能可以忽略不计。
如何理解不同的方法和建议?当我们知道两个分布的真实值时,设计一个测试是最简单的。两个正态分布的重叠真实值很容易计算。交点只是分布的两个均值的平均值,因为它们具有相等的方差: 1. 真正的重叠是 0.3173105:
(true.overlap = pnorm(1,2,1)+ 1-pnorm(1,0,1))
有关计算两个正态分布的交点的一般方法,请参阅https://stackoverflow.com/questions/16982146/point-of-intersection-2-normal-curves/45184024#45184024 。
在原始问题中,存在正态分布和均匀分布的混合。在这种情况下,真正的价值是:
true.value=sum(pmin(diff(pnorm(0:3)),1/3))
运行模拟可以向我们展示哪种估计方法产生的估计值最接近真实值:
library(sfsmisc)
overlap1 <- function(a,b){
lower <- min(c(a, b)) - 1
upper <- max(c(a, b)) + 1
# generate kernel densities
da <- density(a, from=lower, to=upper)
db <- density(b, from=lower, to=upper)
d <- data.frame(x=da$x, a=da$y, b=db$y)
# calculate intersection densities
d$w <- pmin(d$a, d$b)
# integrate areas under curves
total <- integrate.xy(d$x, d$a) + integrate.xy(d$x, d$b)
intersection <- integrate.xy(d$x, d$w)
# compute overlap coefficient
2 * intersection / total
}
library(overlap)
library(scales)
# For explanation of the next function see the answer of S. Venne
overlapEstimates =function(a, b){
a = data.frame( value = a, Source = "a" )
b = data.frame( value = b, Source = "b" )
d = rbind(a, b)
d$value <- scales::rescale( d$value, to = c(0,2*pi) )
a <- d[d$Source == "a", 1]
b <- d[d$Source == "b", 1]
overlapEst(a, b, kmax = 3, adjust=c(0.8, 1, 4), n.grid = 500)
}
nsim=1000; nobs=100; m1=4; sd1=1; m2=6; sd2=1; poi=5
(true.overlap= 1-pnorm(poi, m1, sd1)+pnorm(poi,m2,sd2))
out=matrix(NA,nrow=nsim,ncol=4)
set.seed(0)
for (i in 1:nsim){
x <- rnorm( nobs, m1, sd1 )
y <- rnorm( nobs, m2, sd2 )
out[i,1] = overlap1(x,y)
out[i,2] = overlapping::overlap(list( x = x, y = y ))$OV
out[i,3] = overlapEstimates(x,y)['Dhat4']
out[i,4] = roc.overlap(x,y)
}
(true.overlap=pnorm(poi,m2,sd2)+1-pnorm(poi,m1,sd1))
colMeans(out-true.overlap) # estimation errors
apply(out, 2, sd) # # sd of the estimation errors
apply(out, 2, range)-true.overlap
par(mfrow=c(2,2))
br = seq(-.33,+.33,by=0.05)
hist(out[,1]-true.overlap, breaks=br, ylim=c(0,500),
xlim=c(-.33,.33), main='overlap1');
abline(v=0, col='red')
hist(out[,2]-true.overlap, breaks=br, ylim=c(0,500),
xlim=c(-.33,.33), main='overlapping::overlap')
abline(v=0, col='red')
hist(out[,3]-true.overlap, breaks=br, ylim=c(0,600),
xlim=c(-.33,.33), main='overlap::overlapEst')
abline(v=0, col='red')
hist(out[,4]-true.overlap, breaks=br, ylim=c(0,500),
xlim=c(-.33,.33), main="ROC estimate");
abline(v=0, col='red')
在这种情况下,尤其是函数overlap::overlap 有(轻微)低估的趋势,而overlap1 显示出最小的估计误差。以一种或另一种方式使用密度函数的估计可以产生更好或更差的结果,具体取决于给定密度函数的参数。roc 方法没有参数,这可能是一个优势。
仔细查看重叠分布图并设计相关的测试方法始终是明智的,重叠估计技术对于您拥有的数据类型是否按预期工作。尤其是系统地产生通常太低或太高的估计值的技术最好不要使用。