Confused about the Mann-Whitney UU

机器算法验证 hypothesis-testing mathematical-statistics statistical-significance nonparametric wilcoxon-mann-whitney-test
2022-03-19 14:59:37

I am rather confused about the Mann Whitney test, many statements I read state it tests for distribution equality between two populations and some state it tests for means/median/central tendency only. I ran some simple tests and it shows it only tests for central tendency, not shape. Many books state distribution equality (pdf), why? Can you please explain.

Distribution equality statements

  • Sheldon Ross' book Suppose that one is considering two different methods of production in determining whether the two methods result in statistically identical items. To attack this problem let X1,...,Xn, Y1,...,Ym denote samples of the measurable values of items by method 1 and method 2. If we let F and G, both assumed to be continuous, denote the distribution functions of the two samples, respectively, then the hypothesis we wish to test is H0:F=G. One procedure for testing H0 is the Mann-Whitney test.

  • Some Caltech notes Now suppose we have two samples. We want to know whether they could have been drawn from the same population, or from different populations, and, if the latter, whether they differ in some predicted direction. Again assume we know nothing about probability distributions, so that we need non-parametric tests. Mann-Whitney (Wilcoxon) U test. There are two samples, A (m members) and B (n members); H0 is that A and B are from the same distribution or have the same parent population.

  • Wikipedia This test can be used to investigate whether two independent samples were selected from populations having the same distribution.

  • Nonparametric Statistical Tests The null hypothesis is H0: θ = 0; that is, there is no difference at all between the distribution functions F and G.

But when I use F=N(0,10) and G=U(-3,3) to test, the p-value is very high. They can't be more different except E(F)=E(G) and symmetric.

-----Mean/median equality statements-------

  • ArticleThe Mann–Whitney U-test can be used when the aim is to show a difference between two groups in the value of an ordinal, interval or ratio variable. It is the non-parametric version of the t-test.
  • Test results
#octave
pkg load statistics #import octave statistics package
x = normrnd(0, 1, [1,100]); #100 N(0,1)
y1 = normrnd(0, 3, [1,100]); #100 N(0,3)
y2 = normrnd(0, 20, [1, 100]); #100 N(0,20)
y3 = unifrnd(-5, 5, [1,100]); #100 U(-5,5)
[p, ks] = kolmogorov_smirnov_test(y1, "norm", 0, 1) #KS test if y1==N(0,1)
p = 0.000002; #y of N(0,3) not equal to N(0,1)
[p, z] = u_test(x, y1); #Mann-Whitney of x~N(0,1) vs y~N(0,3)
p = 0.52; #null accepted 
[p, z] = u_test(x, y2); #Mann-Whitney of x~N(0,1) vs y~N(0,20)
p = 0.32; #null accepted
[p, z] u_test(x, y3); #Mann-Whitney of x~N(0,1) vs y~U(-5,5)
p = 0.15; #null accepted
#Apparently, Mann-Whitney doesn't test pdf equality

-------Confusing---------

  • Nonparametric Statistical Methods, 3rd Edition I don't understand how its H0: E(Y)-E(X) = 0 = no-shift, can be deduced from (4.2) which seems to suggest pdf equality (equal higher moments) except the shift.
  • Article The test can detect differences in shape and spread as well as just differences in medians. Differences in population medians are often accompanied by equally important differences in shape. really??how??...confused.

After-thoughts

It seems many notes teach MW in a duck-typing way in which MW is introduced as a duck because if we only focus on key behaviours of a duck (quack=pdf, swim=shape), MW does appear like a duck (location-shift test). Most of the times, a duck and donald duck don't behave too markedly different, so such a MW description seems fine and easy to understand; but when donald duck dominates a duck whilst still quacking like a duck, MW can show significance, baffling unsuspecting students. It is not the students' fault, but a pedagogical mistake by claiming donald duck is a duck without clarifying he can be un-duck at times.

Also, my feeling is that in parametric hypothesis testing, tests are introduced with their purpose framed in H0, making the H1 implicit. Many authors move on to nonparametric testing without first highlighting differences in getting the test-statistics probabilities (permutating X Y samples under H0), so students continue to differentiate tests by looking at H0.

Like we are taught to use t-test for H0:μx=k or H0:μx=μy and F-test for H0:σx2=σy2, with H1:μxμy and H1:σx2σy2 implicit; on the other hand, we need to be explicit about what we test in H1 as H0:F=G is trivially true for all tests of a permutation nature. So when instead of seeing H0:F=G and automatically thinking of H1:FG so it is a K-S test, we should rather pay attention to the H1 in deciding what's under analysis (FG,F>G) and pick a test (KS, MW) accordingly.

2个回答

Neither

The Mann-Whitney(-Wilcoxon) U test is typically a test of H0P(XA>XB)=0.5, rejected in favor of HAP(XA>XB)0.5. In plain language: the probability that a randomly selected observation from group A is greater than a randomly selected observation from group B is one half (i.e. even odds). This could be interpreted as a test for (0th-order) stochastic dominance (i.e. the "stochastically larger than" in the title of the seminal paper).

I write 'typically', because there are both one-sided, and negativist (i.e. there is some difference greater than δ) hypotheses for which U forms the basis of the test statistic.

The (frequent) interpretation of the U test as a test for median difference, for mean difference, or for location shift (pick yer interpretation) results from the two additional (stringent) assumptions:

  1. The distributions of group A and group B have identical shapes.

  2. The distributions of group A and group B have identical variances.

On a personal note, I feel that adding these requisites sharply curtails the generality of the U test's application by tying it to distributional assumptions beyond the (within group) i.i.d. assumption.



References
Mann, H. B., & Whitney, D. R. (1947). On A Test Of Whether One Of Two Random Variables Is Stochastically Larger Than The Other. Annals of Mathematical Statistics, 18, 50–60.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83.

It is informative to see exactly what the Mann-Whitney test does. For two samples X={x1,,xm} and Y={y1,,yn}, under the assumptions that

  • Observations in X are iid
  • Observations in Y are iid
  • The samples X and Y are mutually independent.
  • The respective populations from which X and Y were sampled are continuous.

then, the U statistic is defined as:

U=i=1mj=1nbool(xi<yj)

It should be reasonably intuitive to see that if X and Y represent the same distributions (i.e. the null hypothesis), then the expected value of U would mn/2, since you could expect values below a certain rank to occur as often for X as for Y. So you can think of the Mann Whitney test as checking to what extent the statistic U deviates from this expected value.

If this intuition isn't clear, then think of the first rank (i.e. the leftmost rarest value in each sample). If X and Y were drawn from the same distribution, you would have no reason to expect that the rarest value in X would be less than Y more than 50% of the time, otherwise this would make you think that actually X has a heavier tail than Y. You can extend this logic for the 2nd rarest value, 3rd, and so forth.

Similarly, if you drew the same number of observations, say K, you could almost think of the ranks as K "common bins" with fuzzy boundaries. If X and Y came from the same population, you might expect each rank to occupy roughly the same space, and there's no reason to think that the xkobservation in that bin would be to the right of yk more than 50% of the time.

However, if xk at a particular "bin" k was to the right of yk more often than not, then this denotes that there is a systematic "shift". This is what makes Mann-Whitney a good test for detecting 'shift' in distributions that are assumed to be relatively similar except for a possible shift due to a treatment effect.

Now consider the XN(0,1) vs YN(0,2) scenario. Assume K=1000 samples in each case. You would expect that for the most part, given the same rank, negative values in Y, would tend to be to the left of X more or less all the time. Whereas, positive values in Y, would tend to be to the right of X more or less all the time. Therefore in this particular scenario, even though the distributions are completely different, it happens that half the time X is less likely to be larger than Y, and half the time it is more likely. Therefore you'd expect the U statistic to be very close to the expected value K2/2, and therefore unlikely to be significant.

In other words, it may be a reasonable test to compare two samples in a general "goodness of fit" sense in some specific circumstances, but it is important to be familiar with the situations where it would not. The example above is one such case.