0.9 Error bounds in countably infinite spaces

Statistical learning theory Page 1 / 2

Introduction

In the last lecture , we studied bounds of the following form: for any $δ > 0$ , with probability at least $1 - δ$ ,

R (f) \leq {\hat{R}}_{n} (f) + \sqrt{\frac{log | F | + log (\frac{1}{δ})}{2 n}}, \forall f \in F

which led to upper bounds on the estimation error of the form

E [R ({\hat{f}}_{n})] - min_{f \in F} R (f) \leq \sqrt{\frac{log | F | + log (n) + 2}{n}} .

The key assumptions made in deriving the error bounds were:

(i) bounded loss function
(ii) finite collection of candidate functions

The bounds are valid for every $P_{X Y}$ and are called distribution-free.

Deriving bounds for countably infinite spaces

In this lecture we will generalize the previous results in a powerful way by developing bounds applicable to possibly infinitecollections of candidates. To start let us suppose that $F$ is a countable, possibly infinite, collection of candidate functions.Assign a positive number c( $f$ ) to each $f \in F$ , such that

\sum_{f \in F} e^{- c (f)} < \infty .

The numbers c( $f$ ) can be interpreted as

(i) measures of complexity
(ii) -log of prior probabilities
(iii) codelengths

In particular, if P( $f$ ) is the prior probability of $f$ then

e^{- (- log p (f))} = p (f)

so $c (f) \equiv - log p (f)$ produces

\sum_{f \in F} e^{- c (f)} = \sum_{f \in F} p (f) = 1 .

Now recall Hoeffding's inequality. For each $f$ and every $ϵ > 0$

P (R (f) - {\hat{R}}_{n} (f) \geq ϵ) \leq e^{- 2 n ϵ^{2}}

or for every $δ > 0$

P (R (f) - {\hat{R}}_{n} (f) \geq \sqrt{\frac{log (\frac{1}{δ})}{2 n}}) \leq δ .

Suppose $δ > 0$ is specified. Using the values c( $f$ ) for $f \in F$ , define

δ (f) = e^{- c (f)} δ .

Then we have

P (R (f) - {\hat{R}}_{n} (f) \geq \sqrt{\frac{log (\frac{1}{δ (f)})}{2 n}}) \leq δ (f) .

Furthermore we can apply the union bound as follows

\begin{matrix} P (sup_{f \in F} \{R (f) - {\hat{R}}_{n} (f) - \sqrt{\frac{log (1 / δ (f))}{2 n}}\} \geq 0) & \leq & P (⋃_{f \in F} R (f) - {\hat{R}}_{n} (f) \geq \sqrt{\frac{log (\frac{1}{δ (f)})}{2 n}}) \\ \leq & \sum_{f \in F} P (R (f) - {\hat{R}}_{n} (f) \geq \sqrt{\frac{log (\frac{1}{δ (f)})}{2 n}}) \\ \leq & \sum_{f \in F} δ (f) = \sum_{f \in F} e^{- c (f)} δ = δ \end{matrix} .

So for any $δ > 0$ with probability at least $1 - δ$ , we have that $\forall f \in F$

\begin{matrix} R (f) & \leq & {\hat{R}}_{n} (f) + \sqrt{\frac{log (\frac{1}{δ (f)})}{2 n}} \\ = & {\hat{R}}_{n} (f) + \sqrt{\frac{c (f) + log (\frac{1}{δ})}{2 n}} \end{matrix} .

Special case

Suppose

F

is finite and

c (f) = log | F | \forall f \in F

. Then

\sum_{f \in F} e^{- c (f)} = \sum_{f \in F} e^{- log | F |} = \sum_{f \in F} \frac{1}{| F |} = 1

and

δ (f) = \frac{δ}{| F |}

which implies that for any $δ > 0$ with probability at least $1 - δ$ , we have

R (f) \leq {\hat{R}}_{n} (f) + \sqrt{\frac{log | F | + log (\frac{1}{δ (f)})}{2 n}}, \forall f \in F .

Note that this is precisely the bound we derived in the last lecture .

Choosing c( $f$ )

The generalized bounds allow us to handle countably infinite collections of candidate functions, but we require that

\sum_{f \in F} e^{- c (f)} < \infty .

Of course, if $c (f) = - log p (f)$ where $p (f)$ is a proper prior probability distribution then we have

\sum_{f \in F} e^{- c (f)} = 1 .

However, it may be difficult to design a probability distribution over an infinite class of candidates. The coding perspectiveprovides a very practical means to this end.

Assume that we have assigned a uniquely decodable binary code to each $f \in F$ , and let c( $f$ ) denote the codelength for $f$ . That is, the code for $f$ is c( $f$ ) bits long. A very useful class of uniquely decodable codes are called prefix codes.

Prefix Code

A code is called a prefix codeif no codeword is a prefix of any other codeword.

The kraft inequality

For any binary prefix code, the codeword lengths $c_{1}$ , $c_{2}$ , ... satisfy

\sum_{i = 1}^{\infty} 2^{- c_{i}} \leq 1 .

Conversely, given any $c_{1}$ , $c_{2}$ , ... satisfying the inequality above we can construct a prefix code with these codeword lengths.We will prove this result a bit later, but now let's see how this is useful in our learning problem.

Assume that we have assigned a binary prefix codeword to each $f \in F$ , and let c( $f$ ) denote the bit-length of the codeword for $f$ . Set $δ (f) = 2^{- c (f)} δ$ . Then

\begin{matrix} P (⋃_{f \in F} R (f) - {\hat{R}}_{n} (f) \geq \sqrt{\frac{log (\frac{1}{δ (f)})}{2 n}}) & \leq & \sum_{f \in F} P (R (f) - {\hat{R}}_{n} (f) \geq \sqrt{\frac{log (\frac{1}{δ (f)})}{2 n}}) \\ \leq & \sum_{f \in F} δ (f) = \sum_{f \in F} 2^{- c (f)} δ = δ \end{matrix} .

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask

	Social Psychology MCQ By Saylor Foundation Start Quiz
	19 AP Key Terms 19 Cardiovascular System Heart By OpenStax Start Key Terms
©flickr: Miguel	Gram Positive Infections and Clostridium By Cath Yu Start Quiz
	26 AP 26 Fluid, Electrolyte, Acid-Base Balance Essay By OpenStax Start Flashcards
	Statistics Final Review By Madison Christian Start Exam
	1 Pharmacology Nervous System MCQ By Rohini Ajay Start Quiz
	18 AP Key Terms 18 Cardiovascular System Blood By OpenStax Start Key Terms
	College physics By OpenStax Read Online Course
	Business Law Ethics By Kevin Moquin Start Quiz
©flickr: Alexander	Mechanics I MCQ By Stephanie Redfern Start Quiz