<< Chapter < Page Chapter >> Page >

Logistic regression

We could approach the classification problem ignoring the fact that y is discrete-valued, and use our old linear regression algorithm to try to predict y given x . However, it is easy to construct examples where this method performs very poorly.Intuitively, it also doesn't make sense for h θ ( x ) to take values larger than 1 or smaller than 0 when we know that y { 0 , 1 } .

To fix this, let's change the form for our hypotheses h θ ( x ) . We will choose

h θ ( x ) = g ( θ T x ) = 1 1 + e - θ T x ,

where

g ( z ) = 1 1 + e - z

is called the logistic function or the sigmoid function . Here is a plot showing g ( z ) :

the logistic function

Notice that g ( z ) tends towards 1 as z , and g ( z ) tends towards 0 as z - . Moreover, g(z), and hence also h ( x ) , is always bounded between 0 and 1. As before, we are keeping the conventionof letting x 0 = 1 , so that θ T x = θ 0 + j = 1 n θ j x j .

For now, let's take the choice of g as given. Other functions that smoothly increase from 0 to 1 can also be used, but for a couple of reasons that we'llsee later (when we talk about GLMs, and when we talk about generative learning algorithms),the choice of the logistic function is a fairly natural one. Before moving on, here's a useful property of the derivative of the sigmoid function,which we write as g ' :

g ' ( z ) = d d z 1 1 + e - z = 1 ( 1 + e - z ) 2 e - z = 1 ( 1 + e - z ) · 1 - 1 ( 1 + e - z ) = g ( z ) ( 1 - g ( z ) ) .

So, given the logistic regression model, how do we fit θ for it? Following how we saw least squares regression could be derived as the maximum likelihood estimatorunder a set of assumptions, let's endow our classification model with a set of probabilistic assumptions,and then fit the parameters via maximum likelihood.

Let us assume that

P ( y = 1 x ; θ ) = h θ ( x ) P ( y = 0 x ; θ ) = 1 - h θ ( x )

Note that this can be written more compactly as

p ( y x ; θ ) = h θ ( x ) y 1 - h θ ( x ) 1 - y

Assuming that the m training examples were generated independently, we can then write down the likelihood of the parameters as

L ( θ ) = p ( y X ; θ ) = i = 1 m p ( y ( i ) x ( i ) ; θ ) = i = 1 m h θ ( x ( i ) ) y ( i ) 1 - h θ ( x ( i ) ) 1 - y ( i )

As before, it will be easier to maximize the log likelihood:

( θ ) = log L ( θ ) = i = 1 m y ( i ) log h ( x ( i ) ) + ( 1 - y ( i ) ) log ( 1 - h ( x ( i ) ) )

How do we maximize the likelihood? Similar to our derivation in the case of linear regression, we can use gradient ascent. Written in vectorialnotation, our updates will therefore be given by θ : = θ + α θ ( θ ) . (Note the positive rather than negative sign in the update formula, since we're maximizing, ratherthan minimizing, a function now.) Let's start by working with just one training example ( x , y ) , and take derivatives to derive the stochastic gradient ascent rule:

θ j ( θ ) = y 1 g ( θ T x ) - ( 1 - y ) 1 1 - g ( θ T x ) θ j g ( θ T x ) = y 1 g ( θ T x ) - ( 1 - y ) 1 1 - g ( θ T x ) g ( θ T x ) ( 1 - g ( θ T x ) θ j θ T x = y ( 1 - g ( θ T x ) ) - ( 1 - y ) g ( θ T x ) x j = y - h θ ( x ) x j

Above, we used the fact that g ' ( z ) = g ( z ) ( 1 - g ( z ) ) . This therefore gives us the stochastic gradient ascent rule

θ j : = θ j + α y ( i ) - h θ ( x ( i ) ) x j ( i )

If we compare this to the LMS update rule, we see that it looks identical; but this is not the same algorithm, because h θ ( x ( i ) ) is now defined as a non-linear function of θ T x ( i ) . Nonetheless, it's a little surprising that we end up with the same update rule for a rather differentalgorithm and learning problem. Is this coincidence, or is there a deeper reason behind this? We'll answer this when get get to GLM models. (See alsothe extra credit problem on Q3 of problem set 1.)

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask