<< Chapter < Page Chapter >> Page >
MI ( x i , y ) = x i { 0 , 1 } y { 0 , 1 } p ( x i , y ) log p ( x i , y ) p ( x i ) p ( y ) .

(The equation above assumes that x i and y are binary-valued; more generally the summations would be over the domains of the variables.) The probabilities above p ( x i , y ) , p ( x i ) and p ( y ) can all be estimated according to their empirical distributions on the training set.

To gain intuition about what this score does, note that the mutual information can also be expressed as a Kullback-Leibler (KL) divergence:

MI ( x i , y ) = KL p ( x i , y ) | | p ( x i ) p ( y )

You'll get to play more with KL-divergence in Problem set #3, but informally, this gives a measure of how different the probabilitydistributions p ( x i , y ) and p ( x i ) p ( y ) are. If x i and y are independent random variables, then we would have p ( x i , y ) = p ( x i ) p ( y ) , and the KL-divergence between the two distributions will be zero. This is consistent with the idea if x i and y are independent, then x i is clearly very “non-informative” about y , and thus the score S ( i ) should be small. Conversely, if x i is very “informative” about y , then their mutual information MI ( x i , y ) would be large.

One final detail: Now that you've ranked the features according to their scores S ( i ) , how do you decide how many features k to choose? Well, one standard way to do so is to use cross validation to select among the possible values of k . For example, when applying naive Bayes to text classification—a problem where n , the vocabulary size, is usually very large—using this method to select a feature subset oftenresults in increased classifier accuracy.

Bayesian statistics and regularization

In this section, we will talk about one more tool in our arsenal for our battle against overfitting.

At the beginning of the quarter, we talked about parameter fitting using maximum likelihood (ML), and chose our parameters according to

θ ML = arg max θ i = 1 m p ( y ( i ) | x ( i ) ; θ ) .

Throughout our subsequent discussions, we viewed θ as an unknown parameter of the world.This view of the θ as being constant-valued but unknown is taken in frequentist statistics. In the frequentist this view of the world, θ is not random—it just happens to be unknown—and it's our job to comeup with statistical procedures (such as maximum likelihood) to try to estimate this parameter.

An alternative way to approach our parameter estimation problems is to take the Bayesian view of the world, and think of θ as being a random variable whose value is unknown.In this approach, we would specify a prior distribution p ( θ ) on θ that expresses our “prior beliefs” about the parameters. Given a training set S = { ( x ( i ) , y ( i ) ) } i = 1 m , when we are asked to make a prediction on a new value of x , we can then compute the posterior distribution on the parameters

p ( θ | S ) = p ( S | θ ) p ( θ ) p ( S ) = i = 1 m p ( y ( i ) | x ( i ) , θ ) p ( θ ) θ i = 1 m p ( y ( i ) | x ( i ) , θ ) p ( θ ) d θ

In the equation above, p ( y ( i ) | x ( i ) , θ ) comes from whatever model you're using for your learning problem. For example, if you are using Bayesian logisticregression, then you might choose p ( y ( i ) | x ( i ) , θ ) = h θ ( x ( i ) ) y ( i ) ( 1 - h θ ( x ( i ) ) ) ( 1 - y ( i ) ) , where h θ ( x ( i ) ) = 1 / ( 1 + exp ( - θ T x ( i ) ) ) . Since we are now viewing θ as a random variable, it is okay to condition on it value, and write “ p ( y | x , θ ) ” instead of “ p ( y | x ; θ ) .”

When we are given a new test example x and asked to make it prediction on it, we can compute our posterior distribution on the class label using theposterior distribution on θ :

p ( y | x , S ) = θ p ( y | x , θ ) p ( θ | S ) d θ

In the equation above, p ( θ | S ) comes from [link] . Thus, for example, if the goal is to the predict the expectedvalue of y given x , then we would output The integral below would be replaced by a summation if y is discrete-valued.

E [ y | x , S ] = y y p ( y | x , S ) d y

The procedure that we've outlined here can be thought of as doing “fully Bayesian” prediction, where our prediction is computed by taking an average with respectto the posterior p ( θ | S ) over θ . Unfortunately, in general it is computationally very difficult to compute thisposterior distribution. This is because it requires taking integrals over the (usually high-dimensional) θ as in [link] , and this typically cannot be done in closed-form.

Thus, in practice we will instead approximate the posterior distribution for θ . One common approximation is to replace our posterior distribution for θ (as in [link] ) with a single point estimate. The MAP (maximum a posteriori) estimate for θ is given by

θ MAP = arg max θ i = 1 m p ( y ( i ) | x ( i ) , θ ) p ( θ ) .

Note that this is the same formulas as for the ML (maximum likelihood) estimate for θ , except for the prior p ( θ ) term at the end.

In practical applications, a common choice for the prior p ( θ ) is to assume that θ N ( 0 , τ 2 I ) . Using this choice of prior, the fitted parameters θ MAP will have smaller norm than that selected by maximum likelihood.(See Problem Set #3.) In practice, this causes the Bayesian MAP estimate to be lesssusceptible to overfitting than the ML estimate of the parameters. For example, Bayesian logistic regression turns out to be an effective algorithm fortext classification, even though in text classification we usually have n m .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask