<< Chapter < Page Chapter >> Page >

Naive bayes

In GDA, the feature vectors x were continuous, real-valued vectors. Let's now talk about a different learning algorithm in which the x i 's are discrete-valued.

For our motivating example, consider building an email spam filter using machine learning. Here, we wish to classify messages according to whether they areunsolicited commercial (spam) email, or non-spam email. After learning to do this, we can then have our mail reader automatically filter out the spam messages and perhapsplace them in a separate mail folder. Classifying emails is one example of a broader set of problems called text classification .

Let's say we have a training set (a set of emails labeled as spam or non-spam). We'll begin our construction of our spam filter by specifying the features x i used to represent an email.

We will represent an email via a feature vector whose length is equal to the number of words in the dictionary. Specifically, if an email contains the i -th word of the dictionary, then we will set x i = 1; otherwise, we let x i = 0 . For instance, the vector

x = 1 0 0 1 0 a aardvark aardwolf buy zygmurgy

is used to represent an email that contains the words “a” and “buy,” butnot “aardvark,” “aardwolf” or “zygmurgy.” Actually, rather than looking through an english dictionary for the list of all english words,in practice it is more common to look through our training set and encode in our feature vectoronly the words that occur at least once there. Apart from reducing the number of words modeled and hence reducing our computational and space requirements, this alsohas the advantage of allowing us to model/include as a feature many words that may appear in your email (such as “cs229”) but that you won't find in a dictionary.Sometimes (as in the homework), we also exclude the very high frequency words (which will be words like “the,” “of,” “and,”;these high frequency, “content free” words are called stop words ) since they occur in so many documents and do little to indicate whether an email is spam or non-spam. The set of words encoded into the feature vector is called the vocabulary , so the dimension of x is equal to the size of the vocabulary.

Having chosen our feature vector, we now want to build a generative model. So, we have to model p ( x | y ) . But if we have, say, a vocabulary of 50000 words, then x { 0 , 1 } 50000 ( x is a 50000-dimensional vector of 0's and 1's), and if we were to model x explicitly with a multinomial distribution over the 2 50000 possible outcomes, then we'd end up with a ( 2 50000 - 1 ) -dimensional parameter vector. This is clearly too many parameters.

To model p ( x | y ) , we will therefore make a very strong assumption. We will assume that the x i 's are conditionally independent given y . This assumption is called the Naive Bayes (NB) assumption , and the resulting algorithm is called the Naive Bayes classifier . For instance, if y = 1 means spam email; “buy” is word 2087 and “price” is word 39831; then we are assuming that if I tell you y = 1 (that a particular piece of email is spam), then knowledge of x 2087 (knowledge of whether “buy” appears in the message) will have no effect on your beliefs about the value of x 39831 (whether “price” appears). More formally, this can be written p ( x 2087 | y ) = p ( x 2087 | y , x 39831 ) . (Note that this is not the same as saying that x 2087 and x 39831 are independent, which would have been written “ p ( x 2087 ) = p ( x 2087 | x 39831 ) ”; rather, we are only assuming that x 2087 and x 39831 are conditionally independent given y .)

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask