0.20 Lower performance bounds for estimators (Page 3/4)

Statistical learning theory Page 3 / 4

K (P_{1} | | P_{0}) = \int log \frac{p_{1} (Z)}{p_{0} (Z)} p_{1} (Z) d ν (Z) = \int log \frac{p_{1}}{p_{0}} p_{1}

The following Lemma relates total variation, affinity and KL divergence.

Lemma

$1 - V (P_{0}, P_{1}) \geq \frac{1}{2} A^{2} (P_{0}, P_{1}) \geq \frac{1}{2} exp (- K (P_{1} | | P_{0}))$

For the first inequality,

\begin{matrix} A^{2} (P_{0}, P_{1}) & = & {(\int \sqrt{p_{0} p_{1}})}^{2} \\ = & {(\int \sqrt{min (p_{0}, p_{1}) max (p_{0}, p_{1})})}^{2} \\ = & {(\int \sqrt{min (p_{0}, p_{1})} \sqrt{max (p_{0}, p_{1})})}^{2} \\ \leq & \int min (p_{0}, p_{1}) \int max (p_{0}, p_{1}) by Cauchy-Schwarz inequality \\ = & \int min (p_{0}, p_{1}) (2 - \int min (p_{0}, p_{1})) \overset{∵ \int min (p_{0}, p_{1}) + \int max (p_{0}, p_{1}) = \int p_{0} + \int p_{1} = 2}{} \\ \leq & 2 \int min (p_{0}, p_{1}) \\ = & 2 (1 - V (P_{0}, P_{1})) \end{matrix}

For the second inequality,

\begin{matrix} A^{2} (P_{0}, P_{1}) & = & {(\int \sqrt{p_{0} p_{1}})}^{2} \\ = & exp (log {(\int \sqrt{p_{0} p_{1}})}^{2}) \\ = & exp (2 log (\int \sqrt{p_{0} p_{1}})) \\ = & exp (2 log (\int \sqrt{\frac{p_{0}}{p_{1}}} p_{1})) \\ \geq & exp (2 \int log (\sqrt{\frac{p_{0}}{p_{1}}}) p_{1}) by Jensen's inequality \\ = & exp (- \int log (\sqrt{\frac{p_{1}}{p_{0}}}) p_{1}) \\ = & exp (- K (P_{1} | | P_{0})) \end{matrix}

Putting everything together, we now have the following Theorem:

Theorem

Let $F$ be a class of models, and suppose we have observations $Z$ distributed according to $P_{f}$ , $f \in F$ . Let $d ({\hat{f}}_{n}, f)$ be the performance measure of the estimator ${\hat{f}}_{n} (Z)$ relative to the true model $f$ . Assume also $d (., .)$ is a semi-distance. Let $f_{0}, f_{1} \in F$ be s.t. $d (f_{0}, f_{1}) \geq 2 s_{n}$ . Then

\begin{matrix} inf_{{\hat{f}}_{n}} sup_{f \in F} P_{f} (d ({\hat{f}}_{n}, f) \geq s_{n}) & \geq & inf_{{\hat{f}}_{n}} max_{j \in {0, 1}} P_{f_{j}} (d ({\hat{f}}_{n}, f_{j}) \geq s_{n}) \\ \geq & \frac{1}{4} exp (- K (P_{f_{1}} | | P_{f_{0}})) \end{matrix}

How do we use this theorem?

Choose $f_{0}, f_{1}$ such that $K (P_{1} | | P_{0}) \leq α$ , then $P_{e, 1}$ is bounded away from 0 and we get a bound

inf_{{\hat{f}}_{n}} sup_{f \in F} P_{f} (d ({\hat{f}}_{n}, f) \geq s_{n}) \geq c > 0

or, after Markov's

inf_{{\hat{f}}_{n}} sup_{f \in F} E_{f} [d ({\hat{f}}_{n}, f)] \geq c s_{n}

To apply the theorem, we need to design $f_{0}, f_{1}$ s.t. $d (f_{0}, f_{1}) \geq 2 s_{n}$ and $exp (- K (P_{f_{1}} | | P_{f_{0}})) > 0$ . To reiterate, the design of $f_{0}, f_{1}$ requires careful construction so as to balance the tradeoff between the first condition which requires $f_{0}, f_{1}$ to be far apart, and the second condition which requires $f_{0}, f_{1}$ to be close to each other.

Lets use this theorem in a problem we are familiar with. Let $X \in [0, 1]$ and $Y | X = x \sim Bernoulli (η (x))$ , where $η (x) = P (Y = 1 | X = x)$ .

Suppose $G^{*} = [t^{*}, 1]$ . We proved that under these assumptions and an upper bound on the density of $X$ , the Chernoff bounding technique yielded an expected error rate for ERM

E [R ({\hat{G}}_{n}) - R^{*}] = O (\sqrt{\frac{log n}{n}})

Is this the best possible rate?

Construct two models in the above class (denote it by $P$ ), $P_{X Y}^{(0)}$ and $P_{X Y}^{(1)}$ . For both take $P_{X} \sim Uniform ([0, 1])$ and $η_{(0)} = 1 / 2 - a$ , $η_{(1)} = 1 / 2 + a$ $(a > 0)$ , so $G_{0}^{*} = \emptyset$ , $G_{1}^{*} = [0, 1]$ .

We are interested in controlling the excess risk

R ({\hat{G}}_{n}) - R (G^{*}) = \int_{{\hat{G}}_{n} Δ G^{*}} | 2 η (x) - 1 | d P_{X} (x)

Note that if the true underlying model is either $P_{X Y}^{(0)}$ or $P_{X Y}^{(1)}$ , we have:

R_{j} ({\hat{G}}_{n}) - R_{j} (G_{j}^{*}) = \int_{{\hat{G}}_{n} Δ G_{j}^{*}} | 2 η_{j} (x) - 1 | d x = 2 a \int_{{\hat{G}}_{n} Δ G_{j}^{*}} d x = 2 a d_{Δ} ({\hat{G}}_{n}, G_{j}^{*})

Proposition 1

d_{Δ} (., .)

is a semi-distance.

It suffices to show that $d (G_{1}, G_{2}) = d (G_{2}, G_{1}) \geq 0$ , $d (G, G) = 0$ $\forall G$ and $d (G_{1}, G_{2}) \leq d (G_{1}, G_{3}) + d (G_{3}, G_{2})$ . The first two statements are obvious. The last one (triangle inequality) followsfrom the fact that $G_{1} Δ G_{2} \subseteq (G_{1} Δ G_{3}) \cup (G_{3} Δ G_{2})$ .

Suppose this was not the case, then $\exists x : x \in G_{1} Δ G_{2}$ s.t. $x \notin G_{1} Δ G_{3}$ and $x \notin G_{2} Δ G_{3}$ . In other words,

x \in (G_{1} Δ G_{2}) \cap {(G_{1} Δ G_{3})}^{c} \cap {(G_{2} Δ G_{3})}^{c}

Since $S Δ T = (S \cap T^{c}) \cup (S^{c} \cap T)$ , we have:

\begin{matrix} x & \in & [(G_{1} \cap G_{2}^{c}) \cup (G_{1}^{c} \cap G_{2})] \cap [(G_{1}^{c} \cup G_{3}) \cap (G_{1} \cup G_{3}^{c})] \cap [(G_{2}^{c} \cup G_{3}) \cap (G_{2} \cup G_{3}^{c})] \\ \in & [G_{1} \cap (G_{1}^{c} \cup G_{3}) \cap G_{2}^{c} \cap (G_{2} \cup G_{3}^{c})] \cup [G_{1}^{c} \cap (G_{1} \cup G_{3}^{c}) \cap G_{2} \cap (G_{2}^{c} \cup G_{3})] \\ \in & [G_{1} \cap G_{3} \cap G_{2} \cap G_{3}^{c}] \cup [G_{1}^{c} \cap G_{3}^{c} \cap G_{2} \cap G_{3}] \\ \in & \emptyset, a contradiction \end{matrix}

Lets look at the first reduction step:

\begin{matrix} inf_{{\hat{G}}_{n}} sup_{p \in P} P (R ({\hat{G}}_{n}) - R (G^{*}) \geq s_{n}) & \geq & inf_{{\hat{G}}_{n}} max_{j \in {0, 1}} P_{j} (R_{j} ({\hat{G}}_{n}) - R_{j} (G_{j}^{*}) \geq s_{n}) \\ = & inf_{{\hat{G}}_{n}} max_{j \in {0, 1}} P_{j} (d_{Δ} ({\hat{G}}_{n}, G_{j}^{*}) \geq s_{n} / 2 a) \end{matrix}

So we can work out a bound on $d_{Δ}$ and then translate it to excess risk.

Lets apply Theorem 1 . Note that $d_{Δ} (G_{0}^{*}, G_{1}^{*}) = 1$ and let $P_{0} \overset{Δ}{=} P_{X_{1}, Y_{1}, \dots, X_{n}, Y_{n}}^{(0)}$ and $P_{1} \overset{Δ}{=} P_{X_{1}, Y_{1}, \dots, X_{n}, Y_{n}}^{(1)}$ .

\begin{matrix} K (P_{1} | | P_{0}) & = & E_{1} [log \frac{p_{X_{1}, Y_{1}, \dots, X_{n}, Y_{n}}^{(1)} (X_{1}, Y_{1}, \dots, X_{n}, Y_{n})}{p_{X_{1}, Y_{1}, \dots, X_{n}, Y_{n}}^{(0)} (X_{1}, Y_{1}, \dots, X_{n}, Y_{n})}] \\ = & E_{1} [log \frac{p_{X_{1}, Y_{1}}^{(1)} (X_{1}, Y_{1}) \dots p_{X_{n}, Y_{n}}^{(1)} (X_{n}, Y_{n})}{P_{X_{1}, Y_{1}}^{(0)} (X_{1}, Y_{1}) \dots p_{X_{n}, Y_{n}}^{(0)} (X_{n}, Y_{n})}] \\ = & \sum_{i = 1}^{n} E_{1} [log \frac{p_{X_{i}, Y_{i}}^{(1)} (X_{i}, Y_{i})}{p_{X_{i}, Y_{i}}^{(0)} (X_{i}, Y_{i})}] \\ = & n E_{1} [log \frac{p_{Y | X}^{(1)} (Y_{1} | X_{1})}{p_{Y | X}^{(0)} (Y_{1} | X_{1})}] \end{matrix}

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask

	English Composition 2 Final Practice By Madison Christian Start Test
©flickr: Mike	Examples of Wiffle IQ By Sean WiffleBoy Start Quiz
	8 AP Key Terms 08 The Appendicular Skeleton By OpenStax Start Key Terms
	2 Arts Society: Theater 2 By Jonathan Long Start Quiz
	1 Arts Society: Theater 1 By Jonathan Long Start Quiz
	1 Week 1 Social Psych By Yacoub Jayoghli Start Quiz
	Biology Exam 2 By Vanessa Soledad Start Exam
	Measurement Experimentation Lab MCQ By Steve Gibbs Start Quiz
	4 Biology 04 Cell Structure MCQ By OpenStax Start Quiz
©flickr: Bertram	Chemistry Ch 1 2 Test 1 By Madison Christian Start Quiz