如何将帕累托分布拟合到观察到的 CDF?

机器算法验证 r 配件 累积分布函数 帕累托分布
2022-03-29 18:17:11

背景:我得到了一个包含 CDF 数据的数据框。该列X包含 250 个值,并且该列包含XPp(Xx)

我粘贴下面的数据集:

X <- c(59639.83,396409.6,733179.5,1069949,1406719,1743489,2080259,2417029,2753798,3090568,3427338,3764108,4100878,4437647,4774417,5111187,5447957,5784727,6121496,6458266,6795036,7131806,7468576,7805346,8142115,8478885,8815655,9152425,9489195,9825964,10162734,10499504,10836274,11173044,11509813,11846583,12183353,12520123,12856893,13193662,13530432,13867202,14203972,14540742,14877512,15214281,15551051,15887821,16224591,16561361,16898130,17234900,17571670,17908440,18245210,18581979,18918749,19255519,19592289,19929059,20265829,20602598,20939368,21276138,21612908,21949678,22286447,22623217,22959987,23296757,23633527,23970296,24307066,24643836,24980606,25317376,25654146,25990915,26327685,26664455,27001225,27337995,27674764,28011534,28348304,28685074,29021844,29358613,29695383,30032153,30368923,30705693,31042463,31379232,31716002,32052772,32389542,32726312,33063081,33399851,33736621,34073391,34410161,34746930,35083700,35420470,35757240,36094010,36430780,36767549,37104319,37441089,37777859,38114629,38451398,38788168,39124938,39461708,39798478,40135247,40472017,40808787,41145557,41482327,41819097,42155866,42492636,42829406,43166176,43502946,43839715,44176485,44513255,44850025,45186795,45523564,45860334,46197104,46533874,46870644,47207414,47544183,47880953,48217723,48554493,48891263,49228032,49564802,49901572,50238342,50575112,50911881,51248651,51585421,51922191,52258961,52595731,52932500,53269270,53606040,53942810,54279580,54616349,54953119,55289889,55626659,55963429,56300198,56636968,56973738,57310508,57647278,57984047,58320817,58657587,58994357,59331127,59667897,60004666,60341436,60678206,61014976,61351746,61688515,62025285,62362055,62698825,63035595,63372364,63709134,64045904,64382674,64719444,65056214,65392983,65729753,66066523,66403293,66740063,67076832,67413602,67750372,68087142,68423912,68760681,69097451,69434221,69770991,70107761,70444531,70781300,71118070,71454840,71791610,72128380,72465149,72801919,73138689,73475459,73812229,74148998,74485768,74822538,75159308,75496078,75832848,76169617,76506387,76843157,77179927,77516697,77853466,78190236,78527006,78863776,79200546,79537315,79874085,80210855,80547625,80884395,81221165,81557934,81894704,82231474,82568244,82905014,83241783,83578553,83915323)

P <- c(0.9496295,0.3940323,0.1819006,0.1004568,0.06435878,0.04517915,0.03321521,0.02513521,0.01949269,0.01578545,0.01260615,0.0104641,0.008522678,0.007116022,0.006081262,0.005251359,0.004476574,0.00385598,0.00340083,0.002963359,0.002529698,0.002248921,0.001873215,0.001684442,0.001561909,0.001383793,0.001255323,0.001091163,0.001016704,0.0008913803,0.0008493767,0.0008085694,0.0007478901,0.0006541131,0.0005959151,0.0005181331,0.0004730944,0.0004529344,0.0004312237,0.0003961764,0.0003765924,0.0003047255,0.0002763022,0.0002419195,0.0002233103,0.0002040808,0.0001989854,0.0001915639,0.0001906778,0.0001787147,0.0001592858,0.00015902,0.0001587984,0.0001469683,0.0001410089,0.0001282483,0.0001235296,0.0001094619,8.651059e-05,8.602321e-05,8.159245e-05,6.947432e-05,6.896478e-05,6.32491e-05,6.32491e-05,6.304971e-05,6.304971e-05,6.231864e-05,6.121095e-05,4.927005e-05,4.915928e-05,4.683313e-05,3.768361e-05,3.768361e-05,3.768361e-05,3.768361e-05,3.768361e-05,3.768361e-05,3.768361e-05,3.768361e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.586212e-05,1.559628e-05,1.526397e-05,1.513105e-05,1.513105e-05,1.513105e-05,1.513105e-05,1.513105e-05,7.820291e-06,7.820291e-06,7.310754e-06,7.310754e-06,7.310754e-06,7.310754e-06,7.310754e-06,7.310754e-06,6.446756e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.677531e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06,3.234455e-06)

data <- data.frame(X,P)

当我以对数比例绘制它时,结果如下:

plot(data, log="xy")

数据的对数图

现在,我的老板要求我将帕累托分布拟合到这个图上,这样就会出现一条直线。帕累托函数是这样的:,它们有时有一个最小值来自于它们适合经验数据的位置。f(x)=Axαx

问题:如何将帕累托分布拟合到这些数据?

我没能做到这一点。poweRlaw使用了列作为输入的包,但我不确定这样的事情在统计上是否正确:X

pareto <- conpl$new(data$P)
est <- estimate_xmin(pareto)
pareto$setXmin(est)

结果参数是:该操作的情节如下:xmin=0.0003765924α=1.418147

plot(pareto)
lines(pareto, col=2, lwd=3)

疯狂的帕累托拟合

显然,不引用中的任何值,而是引用中存在的值。而且,如您所见,分布的形状与其他图完全相同,但颠倒了。xminXP

2个回答

首先,从技术上讲,您具有生存功能,它是 CDF 的补充。其次,您必须从上面处理某种形式的审查(保险政策限制?),因为您在 CDF 中有稳定期,并且硬上限为 45,860,334.00。因此,没有“纯”帕累托族可以很好地拟合,因为这些密度是严格递增的函数。

话虽如此,最简单的方法是忽略这些复杂性,并在特定 Pareto 和您的经验 CDF 之间拟合最小距离(例如平方误差)误差项。

在美国精算术语中,未修饰的术语“Pareto”是指 Pareto II 或 Lomax 分布,其中我们使用而不是并且定义为θλx>0

f(x)=αθα(x+θ)α+1F(X)=1(θx+θ)α

这解决了我们精算师所说的“单参数帕累托”问题,其中只有是真正的参数,而只是可能的最小观察值。αθ

以下 R 代码将根据您上面提供的 R 代码将参数拟合到 Lomax:

# Turn survival into CDF
D2 <- data.frame(data$X, 1 - data$P)

# Names
names(D2) <- c('x', 'p')

# Make CDF function separate for later use
plomax <- function(x, a, q) {1 - (q / (q + x)) ^ a}

# Error Function
lomErr <- function(par, data) {
  a <- par[[1L]]
  q <- par[[2L]]
  sum((plomax(data[, 1L], a, q) - data[, 2L]) ^ 2)
  
}

# Fit, although I prefer nloptr
PFit <- optim(c(2.1, 1e5), ParErr, data = D2, method = 'Nelder-Mead')

结果:

Pfit
$par
[1] 9.264474e+00 3.926014e+06

$value
[1] 0.008430267

$counts
function gradient 
     149       NA 

$convergence
[1] 0

$message
NULL

我不确定你打算如何从这里开始,但为了证明拟合是合理的,这里是经验 CDF 与拟合的图:

# Add to data table
D3 <- data.frame(x = D2$x, p = D2$p,
                 pp = plomax(D2$x, PFit$par[[1L]], PFit$par[[2L]]))

# Compare with empirical
plot(D3$p, D3$pp)
abline(0, 1)

在此处输入图像描述

Wikipedia 页面表明最大似然估计具有封闭形式的解决方案。您可以将这些估计值插入参数值并绘制生成的参数 CDF,以查看模型与您的数据的拟合程度。您可以使用 delta 方法构建容差限制(总体百分位数的置信限制),并使用对数链接函数反转 Wald 检验以创建围绕 CDF 的置信带。