<< Chapter < Page Chapter >> Page >

We now have:

p ( x 1 , ... , x 50000 | y ) = p ( x 1 | y ) p ( x 2 | y , x 1 ) p ( x 3 | y , x 1 , x 2 ) p ( x 50000 | y , x 1 , ... , x 49999 ) = p ( x 1 | y ) p ( x 2 | y ) p ( x 3 | y ) p ( x 50000 | y ) = i = 1 n p ( x i | y )

The first equality simply follows from the usual properties of probabilities, and the second equality used the NB assumption.We note that even though the Naive Bayes assumption is an extremely strong assumptions, the resulting algorithm works well on many problems.

Our model is parameterized by Φ i | y = 1 = p ( x i = 1 | y = 1 ) , Φ i | y = 0 = p ( x i = 1 | y = 0 ) , and Φ y = p ( y = 1 ) . As usual, given a training set { ( x ( i ) , y ( i ) ) ; i = 1 , ... , m } , we can write down the joint likelihood of the data:

L ( Φ y , Φ j | y = 0 , Φ j | y = 1 ) = i = 1 m p ( x ( i ) , y ( i ) ) .

Maximizing this with respect to Φ y , Φ i | y = 0 and Φ i | y = 1 gives the maximum likelihood estimates:

Φ j | y = 1 = i = 1 m 1 { x j ( i ) = 1 y ( i ) = 1 } i = 1 m 1 { y ( i ) = 1 } Φ j | y = 0 = i = 1 m 1 { x j ( i ) = 1 y ( i ) = 0 } i = 1 m 1 { y ( i ) = 0 } Φ y = i = 1 m 1 { y ( i ) = 1 } m

In the equations above, the “ ” symbol means “and.” The parameters have a very natural interpretation. For instance, Φ j | y = 1 is just the fraction of the spam ( y = 1 ) emails in which word j does appear.

Having fit all these parameters, to make a prediction on a new example with features x , we then simply calculate

p ( y = 1 | x ) = p ( x | y = 1 ) p ( y = 1 ) p ( x ) = i = 1 n p ( x i | y = 1 ) p ( y = 1 ) i = 1 n p ( x i | y = 1 ) p ( y = 1 ) + i = 1 n p ( x i | y = 0 ) p ( y = 0 ) ,

and pick whichever class has the higher posterior probability.

Lastly, we note that while we have developed the Naive Bayes algorithm mainly for the case of problems where the features x i are binary-valued, the generalization to where x i can take values in { 1 , 2 , ... , k i } is straightforward. Here, we would simply model p ( x i | y ) as multinomial rather than as Bernoulli. Indeed,even if some original input attribute (say, the living area of a house,as in our earlier example) were continuous valued, it is quite common to discretize it—that is, turn it into a small set of discrete values—and apply Naive Bayes. For instance, if we use somefeature x i to represent living area, we might discretize the continuous values as follows:

Living area (sq. feet) < 400 400-800 800-1200 1200-1600 > 1600
x i 1 2 3 4 5

Thus, for a house with living area 890 square feet, we would set the value of the corresponding feature x i to 3. We can then apply the Naive Bayes algorithm, and model p ( x i | y ) with a multinomial distribution, as described previously. When the original, continuous-valued attributes are not well-modeled by a multivariate normal distribution, discretizing thefeatures and using Naive Bayes (instead of GDA) will often result in a better classifier.

Laplace smoothing

The Naive Bayes algorithm as we have described it will work fairly well for many problems, but there is a simple change that makes it work much better, especially for text classification. Let's briefly discuss a problem with the algorithm in its current form, and then talk abouthow we can fix it.

Consider spam/email classification, and let's suppose that, after completing CS229 and having done excellent work on the project, you decide around June 2003 to submit thework you did to the NIPS conference for publication. (NIPS is one of the top machine learning conferences, and the deadline for submitting apaper is typically in late June or early July.) Because you end up discussing the conference in your emails, you also start getting messages with the word “nips” in it. But this is your firstNIPS paper, and until this time, you had not previously seen any emails containing the word “nips”; in particular “nips” did not ever appear in your training set of spam/non-spam emails.Assuming that “nips” was the 35000th word in the dictionary, your Naive Bayes spam filter therefore had picked itsmaximum likelihood estimates of the parameters Φ 35000 | y to be

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask