机器算法验证 - 在 2D 中集成核密度估计器 - 吾爱随笔录

在 2D 中集成核密度估计器

机器算法验证可能性最大似然内核平滑数值积分

2022-02-26 07:30:33

我来自这个问题，以防有人想跟踪。

基本上，我有一个个对象组成的，其中每个对象都附加了给定数量的测量值（在本例中为两个）： $\Omega$ $N$

Ω = o_{1} [x_{1}, y_{1}], o_{2} [x_{2}, y_{2}], . . ., o_{N} [x_{N}, y_{N}]

$\Omega = o_1[x_1, y_1], o_2[x_2, y_2], ..., o_N[x_N, y_N]$

我需要一种方法来确定属于的新对象的概率，因此在该问题中建议我通过核密度估计器获得概率密度，我相信我已经有。 $p[x_p, y_p]$ $\Omega$ $\hat{f}$

因为我的目标是获得这个新对象 ( ) 属于这个 2D 数据集的概率，我被告知要整合 pdf over "支持的值密度小于您观察到的“。“观察到的”密度是在新对象，即：。所以我需要解方程： $p[x_p, y_p]$ $\Omega$ $\hat{f}$ $\hat{f}$ $p$ $\hat{f}(x_p, y_p)$

\iint_{x, y : \hat{f} (x, y) < \hat{f} (x_{p}, y_{p})} \hat{f} (x, y) d x d y

$\iint_{x, y:\hat{f}(x, y) < \hat{f}(x_p, y_p)} \hat{f}(x,y)\,dx\,dy$

我的 2D 数据集的 PDF（通过 python 的stats.gaussian_kde模块获得）如下所示：

在此处输入图像描述

其中红点表示在我的数据集的 PDF 上绘制的新对象 $p[x_p, y_p]$

所以问题是：当pdf看起来像这样时，我如何计算极限 $x, y:\hat{f}(x, y) < \hat{f}(x_p, y_p)$

添加

我做了一些测试，看看我在其中一个评论中提到的蒙特卡洛方法的效果如何。这就是我得到的：

对于较低密度区域，这些值似乎变化更大，两个带宽显示或多或少相同的变化。表中最大的变化出现在比较 Silverman 的 2500 与 1000 样本值的点 (x,y)=(2.4,1.5) 上，其差值为0.0126或~1.3%。就我而言，这在很大程度上是可以接受的。

编辑：我刚刚注意到，根据此处给出的定义，在二维斯科特的规则等同于西尔弗曼的规则。

2个回答

一种简单的方法是栅格化积分域并计算积分的离散近似值。

有一些事情需要注意：

确保覆盖点的范围之外：您需要包括内核密度估计将具有任何可观值的所有位置。这意味着您需要将点的范围扩大到内核带宽的三到四倍（对于高斯内核）。
结果会随着栅格的分辨率而有所不同。 分辨率需要是带宽的一小部分。由于计算时间与栅格中的像元数量成正比，因此使用比预期分辨率更粗糙的分辨率执行一系列计算几乎不需要额外的时间：检查较粗分辨率的结果是否与最好的分辨率。如果不是，则可能需要更精细的分辨率。

这是 256 个点的数据集的示意图：

这些点显示为叠加在两个核密度估计上的黑点。六个大红点是评估算法的“探针”。这已针对分辨率为 1000 x 1000 个单元的四种带宽（默认为 1.8（垂直）和 3（水平）、1/2、1 和 5 个单位）完成。下面的散点图矩阵显示了结果对这六个探测点带宽的依赖程度，这些探测点覆盖了广泛的密度范围：

图 2

出现这种变化有两个原因。显然，密度估计不同，引入了一种形式的变化。更重要的是，密度估计的差异会在任何单个（“探测”）点产生很大的差异。后一种变化在点簇的中等密度“边缘”周围最大 - 正是那些可能最常使用此计算的位置。

这表明在使用和解释这些计算的结果时需要非常谨慎，因为它们可能对相对武断的决定（使用的带宽）非常敏感。

代码

该算法包含在第一个函数的半打行中，f. 为了说明它的使用，其余代码生成前面的图。

library(MASS)     # kde2d
library(spatstat) # im class
f <- function(xy, n, x, y, ...) {
  #
  # Estimate the total where the density does not exceed that at (x,y).
  #
  # `xy` is a 2 by ... array of points.
  # `n`  specifies the numbers of rows and columns to use.
  # `x` and `y` are coordinates of "probe" points.
  # `...` is passed on to `kde2d`.
  #
  # Returns a list:
  #   image:    a raster of the kernel density
  #   integral: the estimates at the probe points.
  #   density:  the estimated densities at the probe points.
  #
  xy.kde <- kde2d(xy[1,], xy[2,], n=n, ...)
  xy.im <- im(t(xy.kde$z), xcol=xy.kde$x, yrow=xy.kde$y) # Allows interpolation $
  z <- interp.im(xy.im, x, y)                            # Densities at the probe points
  c.0 <- sum(xy.kde$z)                                   # Normalization factor $
  i <- sapply(z, function(a) sum(xy.kde$z[xy.kde$z < a])) / c.0
  return(list(image=xy.im, integral=i, density=z))
}
#
# Generate data.
#
n <- 256
set.seed(17)
xy <- matrix(c(rnorm(k <- ceiling(2*n * 0.8), mean=c(6,3), sd=c(3/2, 1)), 
               rnorm(2*n-k, mean=c(2,6), sd=1/2)), nrow=2)
#
# Example of using `f`.
#
y.probe <- 1:6
x.probe <- rep(6, length(y.probe))
lims <- c(min(xy[1,])-15, max(xy[1,])+15, min(xy[2,])-15, max(xy[2,]+15))
ex <- f(xy, 200, x.probe, y.probe, lim=lims)
ex$density; ex$integral
#
# Compare the effects of raster resolution and bandwidth.
#
res <- c(8, 40, 200, 1000)
system.time(
  est.0 <- sapply(res, 
           function(i) f(xy, i, x.probe, y.probe, lims=lims)$integral))
est.0
system.time(
  est.1 <- sapply(res, 
           function(i) f(xy, i, x.probe, y.probe, h=1, lims=lims)$integral))
est.1
system.time(
  est.2 <- sapply(res, 
           function(i) f(xy, i, x.probe, y.probe, h=1/2, lims=lims)$integral))
est.2
system.time(
  est.3 <- sapply(res, 
           function(i) f(xy, i, x.probe, y.probe, h=5, lims=lims)$integral))
est.3
results <- data.frame(Default=est.0[,4], Hp5=est.2[,4], 
                      H1=est.1[,4], H5=est.3[,4])
#
# Compare the integrals at the highest resolution.
#
par(mfrow=c(1,1))
panel <- function(x, y, ...) {
  points(x, y)
  abline(c(0,1), col="Red")
}
pairs(results, lower.panel=panel)
#
# Display two of the density estimates, the data, and the probe points.
#
par(mfrow=c(1,2))
xy.im <- f(xy, 200, x.probe, y.probe, h=0.5)$image
plot(xy.im, main="Bandwidth=1/2", col=terrain.colors(256))
points(t(xy), pch=".", col="Black")
points(x.probe, y.probe, pch=19, col="Red", cex=.5)

xy.im <- f(xy, 200, x.probe, y.probe, h=5)$image
plot(xy.im, main="Bandwidth=5", col=terrain.colors(256))
points(t(xy), pch=".", col="Black")
points(x.probe, y.probe, pch=19, col="Red", cex=.5)

如果您有大量的观察，您可能根本不需要进行任何整合。假设您的新观点是。假设你有一个密度估计器 ; 将的观察次数相加并除以样本大小。这为您提供了所需概率的近似值。 ${\bf{x}}_0$ $\hat f$ ${\bf{x}}$ $\hat f({\bf{x}}) < \hat f({\bf{x}}_0)$

这假设不是“太小”，并且您的样本量足够大（并且足够分散），可以在低密度区域给出一个不错的估计。而言，20000 个案例似乎确实足够大。 $\hat f({\bf{x}}_0)$ ${\bf{x}}$

其它你可能感兴趣的问题

上一篇为什么不使用梯度下降优化 k-means？下一篇适用于时态数据的聚类技术？