<< Chapter < Page Chapter >> Page >
K ( P 1 | | P 0 ) = log p 1 ( Z ) p 0 ( Z ) p 1 ( Z ) d ν ( Z ) = log p 1 p 0 p 1

The following Lemma relates total variation, affinity and KL divergence.

Lemma

1 - V ( P 0 , P 1 ) 1 2 A 2 ( P 0 , P 1 ) 1 2 exp ( - K ( P 1 | | P 0 ) )

For the first inequality,

A 2 ( P 0 , P 1 ) = p 0 p 1 2 = min ( p 0 , p 1 ) max ( p 0 , p 1 ) 2 = min ( p 0 , p 1 ) max ( p 0 , p 1 ) 2 min ( p 0 , p 1 ) max ( p 0 , p 1 ) by Cauchy-Schwarz inequality = min ( p 0 , p 1 ) 2 - min ( p 0 , p 1 ) min ( p 0 , p 1 ) + max ( p 0 , p 1 ) = p 0 + p 1 = 2 2 min ( p 0 , p 1 ) = 2 ( 1 - V ( P 0 , P 1 ) )

For the second inequality,

A 2 ( P 0 , P 1 ) = p 0 p 1 2 = exp log p 0 p 1 2 = exp 2 log p 0 p 1 = exp 2 log p 0 p 1 p 1 exp 2 log p 0 p 1 p 1 by Jensen's inequality = exp - log p 1 p 0 p 1 = exp ( - K ( P 1 | | P 0 ) )

Putting everything together, we now have the following Theorem:

Theorem

Let F be a class of models, and suppose we have observations Z distributed according to P f , f F . Let d ( f ^ n , f ) be the performance measure of the estimator f ^ n ( Z ) relative to the true model f . Assume also d ( . , . ) is a semi-distance. Let f 0 , f 1 F be s.t. d ( f 0 , f 1 ) 2 s n . Then

inf f ^ n sup f F P f ( d ( f ^ n , f ) s n ) inf f ^ n max j { 0 , 1 } P f j ( d ( f ^ n , f j ) s n ) 1 4 exp ( - K ( P f 1 | | P f 0 ) )

How do we use this theorem?

Choose f 0 , f 1 such that K ( P 1 | | P 0 ) α , then P e , 1 is bounded away from 0 and we get a bound

inf f ^ n sup f F P f ( d ( f ^ n , f ) s n ) c > 0

or, after Markov's

inf f ^ n sup f F E f [ d ( f ^ n , f ) ] c s n

To apply the theorem, we need to design f 0 , f 1 s.t. d ( f 0 , f 1 ) 2 s n and exp ( - K ( P f 1 | | P f 0 ) ) > 0 . To reiterate, the design of f 0 , f 1 requires careful construction so as to balance the tradeoff between the first condition which requires f 0 , f 1 to be far apart, and the second condition which requires f 0 , f 1 to be close to each other.

Lets use this theorem in a problem we are familiar with. Let X [ 0 , 1 ] and Y | X = x Bernoulli ( η ( x ) ) , where η ( x ) = P ( Y = 1 | X = x ) .

Suppose G * = [ t * , 1 ] . We proved that under these assumptions and an upper bound on the density of X , the Chernoff bounding technique yielded an expected error rate for ERM

E [ R ( G ^ n ) - R * ] = O log n n

Is this the best possible rate?

Construct two models in the above class (denote it by P ), P X Y ( 0 ) and P X Y ( 1 ) . For both take P X Uniform ( [ 0 , 1 ] ) and η ( 0 ) = 1 / 2 - a , η ( 1 ) = 1 / 2 + a ( a > 0 ) , so G 0 * = , G 1 * = [ 0 , 1 ] .

We are interested in controlling the excess risk

R ( G ^ n ) - R ( G * ) = G ^ n Δ G * | 2 η ( x ) - 1 | d P X ( x )

Note that if the true underlying model is either P X Y ( 0 ) or P X Y ( 1 ) , we have:

R j ( G ^ n ) - R j ( G j * ) = G ^ n Δ G j * | 2 η j ( x ) - 1 | d x = 2 a G ^ n Δ G j * d x = 2 a d Δ ( G ^ n , G j * )

Proposition 1

d Δ ( . , . ) is a semi-distance.

It suffices to show that d ( G 1 , G 2 ) = d ( G 2 , G 1 ) 0 , d ( G , G ) = 0 G and d ( G 1 , G 2 ) d ( G 1 , G 3 ) + d ( G 3 , G 2 ) . The first two statements are obvious. The last one (triangle inequality) followsfrom the fact that G 1 Δ G 2 ( G 1 Δ G 3 ) ( G 3 Δ G 2 ) .

Suppose this was not the case, then x : x G 1 Δ G 2 s.t. x G 1 Δ G 3 and x G 2 Δ G 3 . In other words,

x ( G 1 Δ G 2 ) ( G 1 Δ G 3 ) c ( G 2 Δ G 3 ) c

Since S Δ T = ( S T c ) ( S c T ) , we have:

x [ ( G 1 G 2 c ) ( G 1 c G 2 ) ] [ ( G 1 c G 3 ) ( G 1 G 3 c ) ] [ ( G 2 c G 3 ) ( G 2 G 3 c ) ] [ G 1 ( G 1 c G 3 ) G 2 c ( G 2 G 3 c ) ] [ G 1 c ( G 1 G 3 c ) G 2 ( G 2 c G 3 ) ] [ G 1 G 3 G 2 G 3 c ] [ G 1 c G 3 c G 2 G 3 ] , a contradiction

Lets look at the first reduction step:

inf G ^ n sup p P P ( R ( G ^ n ) - R ( G * ) s n ) inf G ^ n max j { 0 , 1 } P j ( R j ( G ^ n ) - R j ( G j * ) s n ) = inf G ^ n max j { 0 , 1 } P j ( d Δ ( G ^ n , G j * ) s n / 2 a )

So we can work out a bound on d Δ and then translate it to excess risk.

Lets apply Theorem 1 . Note that d Δ ( G 0 * , G 1 * ) = 1 and let P 0 = Δ P X 1 , Y 1 , , X n , Y n ( 0 ) and P 1 = Δ P X 1 , Y 1 , , X n , Y n ( 1 ) .

K ( P 1 | | P 0 ) = E 1 log p X 1 , Y 1 , , X n , Y n ( 1 ) ( X 1 , Y 1 , , X n , Y n ) p X 1 , Y 1 , , X n , Y n ( 0 ) ( X 1 , Y 1 , , X n , Y n ) = E 1 log p X 1 , Y 1 ( 1 ) ( X 1 , Y 1 ) p X n , Y n ( 1 ) ( X n , Y n ) P X 1 , Y 1 ( 0 ) ( X 1 , Y 1 ) p X n , Y n ( 0 ) ( X n , Y n ) = i = 1 n E 1 log p X i , Y i ( 1 ) ( X i , Y i ) p X i , Y i ( 0 ) ( X i , Y i ) = n E 1 log p Y | X ( 1 ) ( Y 1 | X 1 ) p Y | X ( 0 ) ( Y 1 | X 1 )

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask