<< Chapter < Page | Chapter >> Page > |
Suppose $\mathcal{F}$ = {linear classifiers in ${\mathbf{R}}^{d}$ }, then we have
Normally, we have a feature vector $X\in {\mathbf{R}}^{d}$ . A hyperplane in ${\mathbf{R}}^{d}$ provides a linear classifier in ${\mathbf{R}}^{d}$ . Nonlinear classifiers can be obtained by a straightforward generalization.
Let ${\phi}_{1},\cdots ,{\phi}_{{d}^{{}^{\text{'}}}}$ , ${d}^{{}^{\text{'}}}\ge d$ be a collection of functions mapping ${\mathbf{R}}^{d}\to \mathbf{R}$ . These functions, applied to a feature $X\in {\mathbf{R}}^{d}$ , produce a generalized set of features, $\phi ={({\phi}_{1}\left(X\right),{\phi}_{2}\left(X\right),\cdots ,{\phi}_{{d}^{\text{'}}}\left(X\right))}^{\text{'}}$ . For example, if $X={({x}_{1},{x}_{2})}^{\text{'}}$ , then we could consider ${d}^{\text{'}}=S$ and $\phi ={({x}_{1},{x}_{2},{x}_{1}{x}_{2},{x}_{1}^{2},{x}_{2}^{2})}^{\text{'}}\in {\mathbf{R}}^{5}$ . We can then construct a linear classifier in the higher dimensional generalized feature space ${\mathbf{R}}^{{d}^{\text{'}}}$ .
The VC bounds immediately extend to this case, and we have for $\mathcal{F}$ ' = { generalized linear classifiers based on maps $\phi :{\mathbf{R}}^{d}\to {\mathbf{R}}^{{d}^{\text{'}}}$ },
Let $\mathcal{G}$ be a finite-dimensional vector space of real-valued functions on ${\mathbf{R}}^{d}$ . The class of sets $\mathcal{A}=\left\{\right\{x:g\left(x\right)\ge 0\}:g\in \mathcal{G}\}$ has VC dimension $\ge $ dim( $\mathcal{G}$ ).
It is sufficient to show that no set of $n=dim\left(\mathcal{G}\right)+1$ points can be shattered by $\mathcal{A}$ . Take any $n$ points and for each $g\in \mathcal{G}$ , define the vector ${V}_{g}=(g\left({x}_{1}\right),\cdots ,g\left({x}_{n}\right))$ .
The set $\{{V}_{g}:g\in \mathcal{G}\}$ is a linear subspace of ${\mathbf{R}}^{n}$ of dimension $\le $ dim ( $\mathcal{G}$ ) = $n-1$ . Therefore, there exists a non-zero vector $\alpha =({\alpha}_{1},\cdots ,{\alpha}_{n})\in {\mathbf{R}}^{n}$ such that ${\sum}_{i=1}^{n}{\alpha}_{i}g\left({x}_{i}\right)=0$ . We can assume that at least one of these ${\alpha}_{i}^{S}$ is negative (if all are positive, just negate the sum). We can then re-arrange thisexpression as ${\sum}_{i:{\alpha}_{i}\ge 0}{\alpha}_{i}g\left({x}_{i}\right)={\sum}_{i:{\alpha}_{i}<0}-{\alpha}_{i}g\left({x}_{i}\right)$ .
Now suppose that there exists a $g\in \mathcal{G}$ such that the set $\{x:g(x)\ge 0\}$ selects precisely the ${x}_{i}^{S}$ on the left-hand side above. Then all terms on the left are non-negative and allthe terms on the right are non-positive. Since $\alpha $ is non-zero, this is a contradiction. Therefore, ${x}_{1},\cdots ,{x}_{n}$ cannot be shattered by sets in $\{x:g(x)\ge 0\}$ , $g\in \mathcal{G}$ . 6.375pt0.0pt6.375pt
Consider half-spaces in ${\mathbf{R}}^{d}$ of the form $\mathcal{A}=\{x\in {\mathbf{R}}^{d}:{x}_{i}\ge b,i\in \{1,\cdots ,d\},b\in R\}$ . Each half-space can be described by
Let
Let $T\in {\mathcal{T}}_{k}$ . Each cell of $T$ results from splitting a rectangular region into two smaller rectangles parallel to one ofthe coordinate axes.
$T\in {\mathcal{T}}_{3}$ , $d=2$ .
Each additional split is analogous to a half-space set. Therefore, each additional split can potentially shatter $d+1$ points. This implies that
$d=1$ .
$k=1$ split shatters two points.
$k=2$ splits shatters three points $<4$ .
How can we decide what dimension to choose for a generalized linear classifier?
How many leafs should be used for a classification tree?
Complexity Regularization using VC bounds!
SRM is simply complexity regularization using VC type bounds in place of Chernoff's bound or other concentration inequalities.
The basic idea is to consider a sequence of sets of classifiers ${\mathcal{F}}_{1},{\mathcal{F}}_{2},...,$ of increasing VC dimensions ${V}_{{\mathcal{F}}_{1}}\le {V}_{{\mathcal{F}}_{2}}\le ...$ . Then for each $k=1,2,...$ we find the minimum empirical risk classifier
and then select the final classifier according to
and ${\widehat{f}}_{n}\equiv {\widehat{f}}_{n}^{\left(\widehat{k}\right)}$ is the final choice.
The basic rational is that we know
where ${C}^{\text{'}}$ is a constant.
The end result is that
analogous to our pervious complexity regularization results, except thatcodelengths are replaced by VC dimensions.
In order to prove the result we use the VC probability concentration bound and assume that $\u25b3={\sum}_{k\ge 1}{V}_{{\mathcal{F}}_{k}}<\infty $ . This enables a union bounding argument and leads to a risk bound of the form given above.
Complexity of classes depends on richness (shattering capability) relative to a set of $n$ arbitrary points. This allows us to effectively “quantize" collections of functions in a slightlydata-dependent manner.
Let
Then
satisfies
compare with
from Lecture 11 .
Notification Switch
Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?