为什么要分解贝叶斯定理中的分母?

机器算法验证 bayesian
2022-01-23 12:05:48

(我是统计新手。我是一名数学家和程序员,我正在尝试构建类似朴素贝叶斯垃圾邮件过滤器的东西。)

我注意到在很多地方人们倾向于分解贝叶斯定理方程中的分母。所以代替这个:

P(A|B)P(B)P(A)

We are presented with this:

P(A|B)P(B)P(A|B)P(B)+P(A|¬B)P(¬B)

You can see that this convention is used in this Wikipedia article and in this insightful post by Tim Peters.

I am baffled by this. Why is the denominator broken down like this? How does that help things at all? What's so complicated about calculating P(A), which in the case of spam filters would be The probability that the word "cheese" appears in an email, regardless of whether it's spam or not?

3个回答

The short answer to your question is, "most of the time we don't know what P(cheese) is, and it is often (relatively) difficult to calculate."

The longer answer why Bayes' Rule/Theorem is normally stated in the way that you wrote is because in Bayesian problems we have - sitting in our lap - a prior distribution (the P(B) above) and likelihood (the P(A|B), P(A|notB) above) and it is a relatively simple matter of multiplication to compute the posterior (the P(B|A)). Going to the trouble to reexpress P(A) in its summarized form is effort that could be spent elsewhere.

It might not seem so complicated in the context of an email because, as you rightly noted, it's just P(cheese), right? The trouble is that with more involved on-the-battlefield Bayesian problems the denominator is an unsightly integral, which may or may not have a closed-form solution. In fact, sometimes we need sophisticated Monte Carlo methods just to approximate the integral and churning the numbers can be a real pain in the rear.

But more to the point, we usually don't even care what P(cheese) is. Bear in mind, we are trying to hone our belief regarding whether or not an email is spam, and couldn't care less about the marginal distribution of the data (the P(A), above). It is just a normalization constant, anyway, which doesn't depend on the parameter; the act of summation washes out whatever info we had about the parameter. The constant is a nuisance to calculate and is ultimately irrelevant when it comes to zeroing in on our beliefs about whether or not the email's spam. Sometimes we are obliged to calculate it, in which case the quickest way to do so is with the info we already have: the prior and likelihood.

One reason for using the total probability rule is that we often deal with the component probabilities in that expression and it's straightforward to find the marginal probability by simply plugging in the values. For an illustration of this, see the following example on Wikipedia:

Another reason is recognizing equivalent forms of Bayes' Rule by manipulating that expression. For example:

P(B|A)=P(A|B)P(B)P(A|B)P(B)+P(A|¬B)P(¬B)

Divide through the RHS by the numerator:

P(B|A)=11+P(A|¬B)P(A|B)P(¬B)P(B)

Which is a nice equivalent form for Bayes' Rule, made even handier by subtracting this from the original expression to obtain:

P(¬B|A)P(B|A)=P(A|¬B)P(A|B)P(¬B)P(B)

This is Bayes' Rule stated in terms of Odds, i.e. posterior odds against B = Bayes factor against B times the prior odds against B. (Or you could invert it to get an expression in terms of odds for B.) The Bayes factor is the ratio of the likelihoods of your models. Given that we're uncertain about the underlying data generating mechanism, we observe data and update our beliefs.

I'm not sure if you find this useful, but hopefully it's not baffling; you should obviously work with the expression that works best for your scenario. Maybe someone else can pipe in with even better reasons.

Previous replies are detailed enough, but an intuitive way of looking why P(A) (ie dinominator in the Bayes theorem) is broken into two cases.

It is hard to comment about what is the P(A) without any knowledge whether the email is ham or spam. You are correct that "cheese" appears in spam as well as in ham, but if you look at the probability of appearance of "cheese" given that the email is ham (P(A|B), B is representative for ham), you can definitely say a lot of it. At least in my case, I don't receive a lot of spams which contain cheese, therefore in my case P(A|B) will be high (say 90%). Similarly, P(A|¬B) will be low in my case, as not a lot of spams contain the word cheese. Basically, we try to look at the occurrence of event of interest (here A) partitioned into two disjoint events, B and ¬B. If we partition A into two separate events, we can say better about the conditional probabilities P(A|B) and P(A|¬B). In order to get the total probability we also need to weight for the conditional probabilities for the occurance of events on which we condition ie P(B) and P(¬B). Therefore the final expression

P(A)=P(A|B)P(B)+P(A|¬B)P(¬B)