机器算法验证 - Kolmogorov-Smirnov two-sample pp-values - 吾爱随笔录

Kolmogorov-Smirnov two-sample pp-values

机器算法验证 kolmogorov-smirnov-test

2022-03-13 15:31:31

I am using the Kolmogorov–Smirnov two-sample test to compare distributions, and I noticed a $p$ -value is frequently reported as the test statistic. How is this $p$ -value determined? I know it's the probability of obtaining a result at least as large as the one obtained, but how is this $p$ -value determined given this is a nonparametric test? That is, we can't assume Gaussian fluctuations in the distribution and compute the $p$ -value using a $t$ -test.

Thanks!

2个回答

Under the null hypothesis, the asymptotic distribution of the two-sample Kolmogorov–Smirnov statistic is the Kolmogorov distribution, which has CDF

\Pr (K \leq x) = \frac{\sqrt{2 π}}{x} \sum_{i = 1}^{\infty} e^{- (2 i - 1)^{2} π^{2} / (8 x^{2})} .

$\operatorname{Pr}(K\leq x)=\frac{\sqrt{2\pi}}{x}\sum_{i=1}^\infty e^{-(2i-1)^2\pi^2/(8x^2)} \>.$

The $p$ -values can be calculated from this CDF - see Section 4 and Section 2 of the Wikipedia page on the Kolmogorov–Smirnov test.

You seem to be saying that a non-parametric test statistic shouldn't have a distribution - that's not the case - what makes this test non-parametric is that the distribution of the test statistic does not depend on what continuous probability distribution the original data come from. Note that the KS test has this property even for finite samples as shown by @cardinal in the comments.

The p-value of,say 0.80, implies that 80% of samples of size n of samples from the population, will have a D statistic less than the one obtained from the test. This is calculated based on the D-statistic of KS test, which measures the maximum distance between the CDFs of theoretical and empirical distribution, for the given distribution against which the sample is evaluated.

Note that only the value D*SQRT(sample size) has a kolmogrov distribution and not D itself. If you want to manually calculate p value given D value, you can refer the published tables available in the internet for kolomogrov distribution. This is also the value given in packages like R

其它你可能感兴趣的问题

上一篇重复测量结构方程建模下一篇对时间序列电机数据进行分类的最佳算法