机器算法验证 - 发生的概率是多少nn中的随机点dd维度是线性可分的？ - 吾爱随笔录

发生的概率是多少nn中的随机点dd维度是线性可分的？

机器算法验证可能性分类数理统计分离

2022-02-11 18:00:58

给定 $n$ 数据点，每个都有 $d$ 特征， $n/2$ 被标记为 $0$ ，另一个 $n/2$ 被标记为 $1$ . 每个特征都取一个值 $[0,1]$ 随机（均匀分布）。存在一个可以分裂这两个类的超平面的概率是多少？

让我们首先考虑最简单的情况，即 $d = 1$ .

2个回答

假设数据中不存在重复项。

如果，则概率为。 $n\leq d+1$ $\text{Pr}=1$

对于的其他组合，请参见下图： $(n,d)$

我生成了模拟 OP 中指定的输入和输出数据的图。由于Hauck-Donner 效应，线性可分性被定义为逻辑回归模型中的收敛失败。

我们可以看到概率随着的增加而降低。事实上，我们可以拟合一个将与关联起来的模型，结果如下： $n$ $n, d$ $p$

P (n, d) = \frac{1}{1 + e^{- (5.82944 - 4.58261 \times n + 1.37271 \times d - 0.0235785 \times n \times d)}}

$P(n,d)={ 1 \over {1 + e^ {-(5.82944-4.58261\times n + 1.37271 \times d -0.0235785 \times n \times d)} } }$

情节代码（在 Julia 中）：

using GLM

ds = 10; #number of dimensions to be investigated
ns = 100 #number of examples to be investigated
niter = 1000; #number of iterations per d per n
P = niter * ones(Int64, ds, ns); #starting the number of successes

for d in 1:ds
    for n in (d+1):ns
        p = 0 #0 hits
        for i in 1:niter
            println("Dimensions: $d; Samples: $n; Iteration: $i;")
            try #we will try to catch errors in the logistic glm, these are due to perfect separability
                X = hcat(rand((n,d)), ones(n)); #sampling from uniform plus intercept
                Y = sample(0:1, n)  #sampling a binary outcome
                glm(X, Y, Binomial(), LogitLink())
            catch
                p = p+1 #if we catch an error, increase the count
            end
        end
        P[d,n] = p
    end
end

using Plots

gui(heatmap(P./niter, xlabel = "Number of Samples", ylabel = "Number of Dimensions", title = "Probability of linear separability"))

与相关的模型代码（在 Julia 中）： $(n,d)$ $p$

probs = P./niter
N = transpose(repmat(1:ns, 1, ds))
D = repmat(1:ds, 1, ns)

fit = glm(hcat(log.(N[:]), D[:], N[:].*D[:], ones(ds*ns)), probs[:], Binomial(), LogitLink())
coef(fit)
#4-element Array{Float64,1}:
# -4.58261
#  1.37271
# -0.0235785
#  5.82944

gui(heatmap(reshape(predict(fit), ds, ns), xlabel = "Number of Samples", ylabel = "Number of Dimensions", title = "Fit of probability of linear separability"))

这与Cover 定理有关。

这里给出了 Emin Orhan 的一个很好的总结。

Ps：我会在评论中发布这个，但没有足够的声誉。

其它你可能感兴趣的问题

上一篇如何在嵌套交叉验证中获取超参数？下一篇交叉验证 (CV) 和广义交叉验证 (GCV) 统计