机器算法验证 - 寻找适合sigmoid曲线的函数 - 吾爱随笔录

寻找适合sigmoid曲线的函数

机器算法验证 r 曲线拟合样条逻辑曲线 S形曲线

2022-03-30 21:38:29

我正在寻找一个从实验数据点拟合类似 sigmoid 曲线的函数。

模型（函数）无关紧要，它不必与物理相关，我只想能够从任何 x 计算 y。而且我不想在两点之间进行推断。

这是一个例子：

这是相应的原始数据：

| X             | Y              |
|---------------|----------------|
| 0             | 0              |
| 1,6366666667  | -12,2012787905 |
| 3,2733333333  | -13,7833876716 |
| 4,91          | -10,5943208589 |
| 6,5466666667  | -1,3584575518  |
| 8,1833333333  | 8,1590423167   |
| 9,82          | 13,8827937482  |
| 10,4746666667 | 18,4965880076  |
| 11,4566666667 | 42,1205206106  |
| 11,784        | 45,0528073182  |
| 12,4386666667 | 76,8150755186  |
| 13,0933333333 | 80,0883540997  |
| 14,73         | 89,7784173678  |
| 16,3666666667 | 98,8113459392  |
| 19,64         | 104,104366506  |
| 22,9133333333 | 105,9929585305 |
| 26,1866666667 | 94,0070414695  |

你有想法吗？我的问题是某些点的数据低于 0。

编辑：

你们中的一些人对最后一点感到困扰。澄清一下：在曲线的末端，应该有一个平台。最后一点只是有点错误。当我开始拟合时，我可能会从数据中删除它。

4个回答

我认为用小自由度平滑样条曲线就可以了。这是R中的一个例子：

R代码：

txt <- "| 0             | 0              |
| 1.6366666667  | -12.2012787905 |
| 3.2733333333  | -13.7833876716 |
| 4.91          | -10.5943208589 |
| 6.5466666667  | -1.3584575518  |
| 8.1833333333  | 8.1590423167   |
| 9.82          | 13.8827937482  |
| 10.4746666667 | 18.4965880076  |
| 11.4566666667 | 42.1205206106  |
| 11.784        | 45.0528073182  |
| 12.4386666667 | 76.8150755186  |
| 13.0933333333 | 80.0883540997  |
| 14.73         | 89.7784173678  |
| 16.3666666667 | 98.8113459392  |
| 19.64         | 104.104366506  |
| 22.9133333333 | 105.9929585305 |
| 26.1866666667 | 94.0070414695  |"

dat <- read.table(text=txt, sep="|")[,2:3]
names(dat) <- c("x", "y")
plot(dat$y~dat$x, pch = 19, xlab = "x", ylab = "y", main = "Smoothing Splines with Varying df")

spl3 <- smooth.spline(x = dat$x, y = dat$y, df = 3)
lines(spl3, col = 2)

spl8 <- smooth.spline(x = dat$x, y = dat$y, df = 8)
lines(spl8, col = 4)

legend("topleft", c("df = 3", "df = 8"), col = c(2,4), bty = "n", lty = 1)

为了以非参数方式拟合类似 sigmoid 的函数，我们可以使用单调样条。这是在 R 包中实现的（这里引用的所有 R 包都在 CRAN 上）splines2。我将从@Chaconne 的答案中借用一些 R 代码，并根据我的需要进行修改。

splines2提供函数mSpline，实现 M 样条，它是一个无处不在的非负（在定义的区间上）样条基，和iSpline，M 样条基的积分。最后一个是单调递增的，因此我们可以通过将它们用作回归样条基础来拟合递增函数，并拟合限制系数为非负的线性模型。最后一个由 R 包以用户友好的方式实现colf，“线性函数的约束优化”。合身看起来像：

使用的 R 代码：

library(splines2) # includes monotone splines,  M-splines,  I-splines.
library(colf) # constrained optimization on linear functions

 txt <- "| 0             | 0              |
    | 1.6366666667  | -12.2012787905 |
    | 3.2733333333  | -13.7833876716 |
    | 4.91          | -10.5943208589 |
    | 6.5466666667  | -1.3584575518  |
    | 8.1833333333  | 8.1590423167   |
    | 9.82          | 13.8827937482  |
    | 10.4746666667 | 18.4965880076  |
    | 11.4566666667 | 42.1205206106  |
    | 11.784        | 45.0528073182  |
    | 12.4386666667 | 76.8150755186  |
    | 13.0933333333 | 80.0883540997  |
    | 14.73         | 89.7784173678  |
    | 16.3666666667 | 98.8113459392  |
    | 19.64         | 104.104366506  |
    | 22.9133333333 | 105.9929585305 |
    | 26.1866666667 | 94.0070414695  |"

    dat <- read.table(text=txt, sep="|")[,2:3]
names(dat) <- c("x", "y")
plot(dat$y ~ dat$x, pch = 19, xlab = "x", ylab = "y", main = "Monotone Splines with Varying df")

Imod_df_4  <-  colf_nls(y ~ 1 + iSpline(x, df=4), data=dat, lower=c(-Inf, rep(0, 4)), control=nls.control(maxiter=1000, tol=1e-09, minFactor=1/2048) )
lines(dat$x, fitted(Imod_df_4), col="blue")

Imod_df_6  <-  colf_nls(y ~ 1 + iSpline(x, df=6), data=dat, lower=c(-Inf, rep(0, 6)), control=nls.control(maxiter=1000, tol=1e-09, minFactor=1/2048) )
lines(dat$x, fitted(Imod_df_6), col="orange")

Imod_df_8  <-  colf_nls(y ~ 1 + iSpline(x, df=8), data=dat, lower=c(-Inf, rep(0, 8)), control=nls.control(maxiter=1000, tol=1e-09, minFactor=1/2048) )
lines(dat$x, fitted(Imod_df_8), col="red") 

EDIT

样条上的单调限制是形状限制样条的一种特殊情况，现在有一个（实际上是几个）R 包实现了那些简化它们的使用。我将使用其中一个包再次执行上述示例。R代码如下，使用上面读取的数据：

library(cgam)

mod_cgam0 <- cgam(y ~ 1+s.incr(x), data=dat, family=gaussian)
summary(mod_cgam0)
Call:
cgam(formula = y ~ 1 + s.incr(x), family = gaussian, data = dat)

Coefficients:
            Estimate  StdErr t.value   p.value    
(Intercept)  43.4925  2.7748  15.674 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 102.2557)

Null deviance:  33749.25  on 16  degrees of freedom 
Residual deviance:  1636.091  on 12.5  observed degrees of freedom 

Approximate significance of smooth terms: 
          edf mixture.of.Beta   p.value    
s.incr(x)   3          0.9515 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
CIC:  7.6873

这样，结（和自由度）已被自动选择。要固定自由度的数量，请使用：

mod_cgam1 <- cgam(y ~ 1+s.incr(x, numknots=5), data=dat, family=gaussian)

一篇介绍 cgam 的论文在这里（arxiv）。

您显示的曲线看起来更像是一个三次函数，$ax^3+bx^2+cx+d$，因为末端向上和向下转动，而不是平展/水平延伸。或者类似的东西，用 Excel 中的多项式趋势线制作：

但除此之外，如果您希望两端水平延伸，则有许多 Sigmoidal CDF 概率分布可供选择。在选择最合适的分布时，您需要问自己的问题是：

S 形曲线的基本机制/基本原理是什么？
它的形状需要多灵活？多少自由度？这将取决于有多少数据点，因为您希望避免过度拟合。而且，哪些特征会发生变化，哪些特征保持不变？均值？方差（传播）？偏度（不平衡）？峰度（尾部）？
然后，您可以从维基百科 ( https://en.wikipedia.org/wiki/List_of_probability_distributions )中的此列表中搜索正确的形状，或者使用更多详细信息细化您的问题以获得最佳答案。
还有基于 logit 函数的 4 和 5 参数分布，形状更灵活，但同样，除非你有很多数据点，否则应该避免使用。

和 PS。您永远不应该有选择地添加或删除数据点以进行拟合 - BAD BOY！

您可以使用 R 中 {pracma} 包中的 sigmoid() 函数。该函数将 sigmoidal 曲线拟合到数值向量。如果您不在乎什么函数适合数据，我会推荐 R 中 {mgcv} 包中的 gam() 函数。它使用样条回归拟合数据的平滑函数（默认为薄板，但您可以查看其他类型的文档）。使用 gam()，与任何非参数模型拟合一样，您将无法在数据集中 x 范围之外的任何可靠性从 x 预测 y，因为预测将简单地遵循“最后一个曲线的斜率”，但从您的问题来看，您似乎对此并不关心。希望这可以帮助！

其它你可能感兴趣的问题

上一篇试图了解拟合与残差图？下一篇相关研究的样本量计算