<< Chapter < Page | Chapter >> Page > |
And so having estimated all these parameters, when you’re given a new piece of email that you want to classify, you can then compute PFY given X using Bayes rule, right? Same as before because together these parameters gives you a model for PFX given Y and for PFY, and by using Bayes rule, given these two terms, you can compute PFX given Y, and there’s your spam classifier, okay? Turns out we need one more elaboration to this idea, but let me check if there are questions about this so far.
Student: So does this model depend on the number of inputs?
Instructor (Andrew Ng) :What do you mean, number of inputs, the number of features?
Student: No, number of samples.
Instructor (Andrew Ng) :Well, N is the number of training examples, so this given M training examples, this is the formula for the maximum likelihood estimate of the parameters, right? So other questions, does it make sense? Or M is the number of training examples, so when you have M training examples, you plug them into this formula, and that’s how you compute the maximum likelihood estimates.
Student: Is training examples you mean M is the number of emails?
Instructor (Andrew Ng) :Yeah, right. So, right. So it’s, kind of, your training set. I would go through all the email I’ve gotten in the last two months and label them as spam or not spam, and so you have – I don’t know, like, a few hundred emails labeled as spam or not spam, and that will comprise your training sets for X1 and Y1 through XM, YM, where X is one of those vectors representing which words appeared in the email and Y is 0, 1 depending on whether they equal spam or not spam, okay?
Student: So you are saying that this model depends on the number of examples, but the last model doesn’t depend on the models, but your phi is the same for either one.
Instructor (Andrew Ng) :They’re different things, right? There’s the model which is – the modeling assumptions aren’t made very well. I’m assuming that – I’m making the Naive Bayes assumption. So the probabilistic model is an assumption on the joint distribution of X and Y. That’s what the model is, and then I’m given a fixed number of training examples. I’m given M training examples, and then it’s, like, after I’m given the training sets, I’ll then go in to write the maximum likelihood estimate of the parameters, right? So that’s, sort of, maybe we should take that offline for – yeah, ask a question?
Student: Then how would you do this, like, if this [inaudible] didn’t work?
Instructor (Andrew Ng) :Say that again.
Student: How would you do it, say, like the 50,000 words –
Instructor (Andrew Ng) :Oh, okay. How to do this with the 50,000 words, yeah. So it turns out this is, sort of, a very practical question, really. How do I count this list of words? One common way to do this is to actually find some way to count a list of words, like go through all your emails, go through all the – in practice, one common way to count a list of words is to just take all the words that appear in your training set.
That’s one fairly common way to do it, or if that turns out to be too many words, you can take all words that appear at least three times in your training set. So words that you didn’t even see three times in the emails you got in the last two months, you discard. So those are – I was talking about going through a dictionary, which is a nice way of thinking about it, but in practice, you might go through your training set and then just take the union of all the words that appear in it.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?