<< Chapter < Page Chapter >> Page >

Take

Z = | R n ( f ) ^ - R ( f ) | and t = ϵ
P ( | R n ^ ( f ) - R ( f ) | ϵ ) E [ | R n ( f ) ^ - R ( f ) | 2 ] ϵ 2 var ( R ^ n ( f ) ) ϵ 2 = i = 1 n var ( L i n ) ϵ 2 = var ( ( X ) , Y ) n ϵ 2 = σ L 2 n ϵ 2 .

So, the probability goes to zero at a rate of at least n - 1 . However, it turns out that this is an extremely loose bound. Accordingto the Central Limit Theorem

R n ^ ( f ) = 1 n i = 1 n L i N R ( f ) , σ L 2 n as n

in distribution. This suggests that for large values of n,

P ( | R n ^ ( f ) - R ( f ) | ϵ ) O e - n ϵ 2 2 σ L 2 .

That is, the Gaussian tail probability is tending to zero exponentially fast.

Chernoff's bound

Note that for any nonnegative random variable Z and t > 0 ,

P ( Z t ) = P ( e s Z e s t ) E [ e s Z ] e s t , s > 0 by Markov's inequality .

Chernoff's bound is based on finding the value of s that minimizes the upper bound. If Z is a sum of independent random variables. For example, say

Z = i = 1 n ( f ( X i ) , Y i ) - R ( f ) = n R ^ n ( f ) - R ( f )

then the bound becomes

P i = 1 n ( L i - E [ L i ] ) t e - s t E [ e s i = 1 n ( L i - E [ L i ] ) ] e - s t i = 1 n E [ e s ( L i - E [ L i ] ) ] , from independence.

Thus, the problem of finding a tight bound boils down to finding a good bound for E [ s s ( L i - E [ L i ] ) ] . Chernoff ('52), first studied this situation for binary random variables. Then,Hoeffding ('63) derived a more general result for arbitrary bounded random variables.

Hoeffding's indequality

Theorem

Hoeffding's inequality

Let Z 1 , Z 2 , . . . , Z n be independent bounded random variables such that Z i [ a i , b i ] with probability 1. Let S n = i = 1 n Z i . Then for any t > 0 , we have

P ( | S n - E [ S n ] | t ) 2 e - 2 t 2 i = 1 n ( b i - a i ) 2 .

The key to proving Hoeffding's inequality is the following upper bound: if Z is a random variable with E [ Z ] = 0 and a Z b , then

E [ e s Z ] e s 2 ( b - a ) 2 8 .

This upper bound is derived as follows. By the convexity of theexponential function,

e s z z - a b - a e s b + b - z b - a e s a , for a z b .
Convexity of exponential function.

Thus,

E [ e s Z ] E Z - a b - a e s b + E b - Z b - a e s a = b b - a e s a - a b - a e s b , since E [ Z ] = 0 = ( 1 - θ + θ e s ( b - a ) ) e - θ s ( b - a ) , where θ = - a b - a .

Now let

u = s ( b - a ) and define φ ( u ) - θ u + log ( 1 - θ + θ e u ) .

Then we have

E [ e s Z ] ( 1 - θ + θ e s ( b - a ) ) e - θ s ( b - a ) = e φ ( u ) .

To minimize the upper bound let's express φ ( u ) in a Taylor's series with remainder :

φ ( u ) = φ ( 0 ) + u φ ' ( 0 ) + u 2 2 φ ' ' ( v ) for some v [ 0 , u ]
φ ' ( u ) = - θ + θ e u 1 - θ + θ e u φ ' ( u ) = 0 φ ' ' ( u ) = θ e u 1 - θ + θ e u - ( θ e u ) 2 ( 1 - θ + θ e u ) 2 = θ e u 1 - θ + θ e u ( 1 - θ e u 1 - θ + θ e u ) = ρ ( 1 - ρ ) .

Now, φ ' ' ( u ) is maximized by

ρ = θ e u 1 - θ + θ e u = 1 2 φ ' ' ( u ) 1 4 .

So,

φ ( u ) u 2 8 = s 2 ( b - a ) 2 8
E [ e s Z ] e s 2 ( b - a ) 2 8 .

Now, we can apply this upper bound to derive Hoeffding's inequality.

P ( S n - E [ S n ] t ) e - s t i = 1 n E [ e s ( L i - E [ L i ] ) ] e - s t i = 1 n e s 2 ( b i - a i ) 2 8 = e - s t e s 2 i = 1 n ( b i - a i ) 2 8 = e - 2 t 2 i = 1 n ( b i - a i ) 2 by choosing s = 4 t i = 1 n ( b i - a i ) 2

Similarly, P ( E [ S n ] - S n t ) e - 2 t 2 i = 1 n ( b i - a i ) 2 . This completes the proof of the Hoeffding's theorem.

Application

Let Z i = 1 f ( X i ) Y i - R ( f ) , as in the classification problem. Then for a fixed f, it follows fromHoeffding's inequality (i.e., Chernoff's bound in this special case) that

P ( | R n ^ ( f ) - R ( f ) | ϵ ) = P 1 n | S n - E [ S n ] | ϵ = P ( | S n - E [ S n ] | n ϵ ) 2 e - 2 ( n ϵ ) 2 n = 2 e - 2 n ϵ 2 .

Now, we want a bound like this to hold uniformly for all f F . Assume that F is a finite collection of models and let | F | denote its cardinality. We would like to bound the probability that max f F | R n ^ ( f ) - R ( f ) | ϵ . Note that the event

max f F | R n ^ ( f ) - R ( f ) | ϵ f F | R n ^ ( f ) - R ( f ) | ϵ .

Therefore

P max f F | R n ^ ( f ) - R ( f ) | ϵ = P f F | R n ^ ( f ) - R ( f ) | ϵ f F P ( | R n ^ ( f ) - R ( f ) | ϵ ) , the `` union of events '' bound 2 | F | e - 2 n ϵ 2 , by Hoeffding's inequality.

Thus, we have shown that with probability at least 1 - 2 | F | e - 2 n ϵ 2 , f F

| R n ^ ( f ) - R ( f ) | < ϵ .

And accordingly, we can be reasonably confident in selecting f from F based on the empirical risk function R ^ n .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask