0.19 Applications of vc bound

Statistical learning theory Page 1 / 1

Linear classifiers

Suppose $F$ = {linear classifiers in $R^{d}$ }, then we have

V_{F} = d + 1, {\hat{f}}_{n} = arg min_{f \in F} {\hat{R}}_{n} (f)

E [R ({\hat{f}}_{n})] - inf_{f \in F} R (f) \leq 4 \sqrt{\frac{(d + 1) log (n + 1) + log 2}{n}} .

Generalized linear classifiers

Normally, we have a feature vector $X \in R^{d}$ . A hyperplane in $R^{d}$ provides a linear classifier in $R^{d}$ . Nonlinear classifiers can be obtained by a straightforward generalization.

Let $φ_{1}, \dots, φ_{d^{^{'}}}$ , $d^{^{'}} \geq d$ be a collection of functions mapping $R^{d} \to R$ . These functions, applied to a feature $X \in R^{d}$ , produce a generalized set of features, $φ = {(φ_{1} (X), φ_{2} (X), \dots, φ_{d^{'}} (X))}^{'}$ . For example, if $X = {(x_{1}, x_{2})}^{'}$ , then we could consider $d^{'} = S$ and $φ = {(x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2})}^{'} \in R^{5}$ . We can then construct a linear classifier in the higher dimensional generalized feature space $R^{d^{'}}$ .

The VC bounds immediately extend to this case, and we have for $F$ ' = { generalized linear classifiers based on maps $φ : R^{d} \to R^{d^{'}}$ },

E [R ({\hat{f}}_{n})] - inf_{f \in F^{'}} R (f) \leq 4 \sqrt{\frac{(d^{'} + 1) log (n + 1) + log 2}{n}} .

Half-space classifiers

Theorem

Steele '75, dudley '78

Let $G$ be a finite-dimensional vector space of real-valued functions on $R^{d}$ . The class of sets $A = {{x : g (x) \geq 0} : g \in G}$ has VC dimension $\geq$ dim( $G$ ).

It is sufficient to show that no set of $n = d i m (G) + 1$ points can be shattered by $A$ . Take any $n$ points and for each $g \in G$ , define the vector $V_{g} = (g (x_{1}), \dots, g (x_{n}))$ .

The set ${V_{g} : g \in G}$ is a linear subspace of $R^{n}$ of dimension $\leq$ dim ( $G$ ) = $n - 1$ . Therefore, there exists a non-zero vector $α = (α_{1}, \dots, α_{n}) \in R^{n}$ such that $\sum_{i = 1}^{n} α_{i} g (x_{i}) = 0$ . We can assume that at least one of these $α_{i}^{S}$ is negative (if all are positive, just negate the sum). We can then re-arrange thisexpression as $\sum_{i : α_{i} \geq 0} α_{i} g (x_{i}) = \sum_{i : α_{i} < 0} - α_{i} g (x_{i})$ .

Now suppose that there exists a $g \in G$ such that the set ${x : g (x) \geq 0}$ selects precisely the $x_{i}^{S}$ on the left-hand side above. Then all terms on the left are non-negative and allthe terms on the right are non-positive. Since $α$ is non-zero, this is a contradiction. Therefore, $x_{1}, \dots, x_{n}$ cannot be shattered by sets in ${x : g (x) \geq 0}$ , $g \in G$ . 6.375pt0.0pt6.375pt

Consider half-spaces in $R^{d}$ of the form $A = {x \in R^{d} : x_{i} \geq b, i \in {1, \dots, d}, b \in R}$ . Each half-space can be described by

g (x) = [0, \dots, 0, 1, 0, \dots, 0] [\begin{matrix} x_{1} \\ ⋮ \\ x_{d} \end{matrix}] - b

\Rightarrow d i m (G) = d + 1, V_{A} \leq d + 1 .

Tree classifiers

Let

T_{k} = \{r e c u r s i v e r e c t a n g u l a r p a r t i t i o n s o f R^{d} w i t h k + 1 c e l l s\}

Let $T \in T_{k}$ . Each cell of $T$ results from splitting a rectangular region into two smaller rectangles parallel to one ofthe coordinate axes.

$T \in T_{3}$ , $d = 2$ .

Each additional split is analogous to a half-space set. Therefore, each additional split can potentially shatter $d + 1$ points. This implies that

V_{T_{k}} \leq (d + 1) k .

$d = 1$ .

$k = 1$ split shatters two points.

$k = 2$ splits shatters three points $< 4$ .

Vc bound for tree classifiers

F_{k} = {t r e e c l a s s i f i e r s w i t h k + 1 l e a f s o n R^{d}}

E [R ({\hat{f}}_{n})] - inf_{f \in F_{k}} R (f) \leq 4 \sqrt{\frac{(d + 1) k log n + log 2}{n}} .

How can we decide what dimension to choose for a generalized linear classifier?

How many leafs should be used for a classification tree?

Complexity Regularization using VC bounds!

Structural risk minimization (srm)

SRM is simply complexity regularization using VC type bounds in place of Chernoff's bound or other concentration inequalities.

The basic idea is to consider a sequence of sets of classifiers $F_{1}, F_{2}, ...,$ of increasing VC dimensions $V_{F_{1}} \leq V_{F_{2}} \leq ...$ . Then for each $k = 1, 2, ...$ we find the minimum empirical risk classifier

{\hat{f}}_{n}^{(k)} = arg min_{f \in F_{k}} {\hat{R}}_{n} (f)

and then select the final classifier according to

\hat{k} = arg min_{k \geq 1} \{{\hat{R}}_{n} ({\hat{t}}_{n}^{(k)}) + \sqrt{\frac{32 V_{F_{k}} (log n + 1)}{n}}\}

and ${\hat{f}}_{n} \equiv {\hat{f}}_{n}^{(\hat{k})}$ is the final choice.

The basic rational is that we know

R_{n} ({\hat{f}}_{n}^{(k)}) - inf_{f \in F_{k}} R (f) \leq C^{'} \sqrt{\frac{V_{F_{k}} log n}{n}}

where $C^{'}$ is a constant.

The end result is that

E [R ({\hat{f}}_{n})] \leq min_{k \geq 1} \{min_{f \in F_{k}} R (f) + 16 \sqrt{\frac{V_{F_{k}} log n + 4}{2 n}}\}

analogous to our pervious complexity regularization results, except thatcodelengths are replaced by VC dimensions.

In order to prove the result we use the VC probability concentration bound and assume that $△ = \sum_{k \geq 1} V_{F_{k}} < \infty$ . This enables a union bounding argument and leads to a risk bound of the form given above.

Key point of vc theory

Complexity of classes depends on richness (shattering capability) relative to a set of $n$ arbitrary points. This allows us to effectively “quantize" collections of functions in a slightlydata-dependent manner.

Application to trees

Let

F_{k} = \{k, l, e, a, f, d, e, c, i, s, i, o, n, t, r, e, e, s, i, n, R^{d}\}, V_{F_{k}} \leq (d + 1) (k + 1)

{\hat{f}}_{n}^{(k)} = arg min_{f \in F_{k}} {\hat{R}}_{n} (f)

\hat{k} = arg min_{k \geq 1} (min_{f \in F_{k}} R (f) + \sqrt{\frac{32 (d + 1) (k - 1) (log n + 1)}{n}})

Then

{\hat{f}}_{n} = {\hat{f}}_{n}^{(\hat{k})}

satisfies

E [R ({\hat{f}}_{n})] \leq min_{k \geq 1} (min_{f \in F_{k}} R (f) + 16 \sqrt{\frac{(d + 1) (k - 1) log n + 4}{2 n}})

compare with

E [R ({\hat{f}}_{n})] \leq min_{k \geq 1} (min_{f \in d y a d i c k l e a f t r e e s} R (f) + \sqrt{\frac{(3 k - 1) log 2 + \frac{1}{2} log n}{2 n}})

from Lecture 11 .

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask

	2011 Dynamics CRM By Danielrosenberger Start Quiz
	1 Neuroscience Exam 2004 1 By David Corey Start Exam
	Social Organization Kinship By Richley Crapo Start Assignment
	21 Kidney/Liver Biopath By Brooke Delaney Start Exam
©flickr: Gareth	Resume Writing MCQ By Abby Sharp Start Quiz
	Anthropology Language Culture By Richley Crapo Start Assignment
	4 BOD Hemolymphatic -Dr. Han By Brooke Delaney Start Exam
	Vocabulary Week 1-3 By Rachel Woolard Start Quiz
©flickr: Steve	C Programming Language By JavaChamp Team Start Quiz
	Anthropology Politics Culture By Richley Crapo Start Assignment