<< Chapter < Page | Chapter >> Page > |
The VC inequality is a powerful generalization of the bounds we obtained for the hyperplane classifier in the previous lecture . The basic idea of the proof is quite similar. Before starting the inequality, we need to introduce theconcept of shatter coefficients and VC dimension.
Let $\mathcal{A}$ be a collection of subsets of ${\mathcal{R}}^{d}$ , definition: The ${n}^{th}$ shatter coefficient of $\mathcal{A}$ is defined by
The shatter coefficients are a measure of the richness of the collection $\mathcal{A}$ . ${\mathcal{S}}_{\mathcal{A}}\left(n\right)$ is the largest number of different subsets of a set of $n$ points that can be generated by intersecting the set with elements of $\mathcal{A}$ .
In 1-d, Let $\mathcal{A}=\{\left(-,\infty ,,,t\right],t\phantom{\rule{0.166667em}{0ex}}\u03f5\phantom{\rule{0.166667em}{0ex}}\mathcal{R}\}$ Possible subsets of $\{{x}_{1},...,{x}_{n}\}$ generated by intersecting with sets of the form $\left(-,\infty ,,,t\right]$ are $\{{x}_{1},...,{x}_{n}\},\phantom{\rule{0.166667em}{0ex}}\{{x}_{1},\phantom{\rule{0.166667em}{0ex}}...,{x}_{n-1}\},\phantom{\rule{0.166667em}{0ex}}...,\phantom{\rule{0.166667em}{0ex}}\left\{{x}_{1}\right\},\phantom{\rule{0.166667em}{0ex}}\phi $ . Hence ${\mathcal{S}}_{d}\left(n\right)=n+1$ .
In 2-d, Let $\mathcal{A}$ = $\{$ all rectangles in ${\mathcal{R}}^{2}$ $\}$
Consider a set $\{{x}_{1},\phantom{\rule{0.166667em}{0ex}}{x}_{2},\phantom{\rule{0.166667em}{0ex}}{x}_{3},\phantom{\rule{0.166667em}{0ex}}{x}_{4}\}$ of training points. If we arrange the four points into the corner of a diamond shape. It's easyto see that we can find a rectangle in ${\mathcal{R}}^{2}$ to cover any subsets of the four points as the above picture, i.e. ${\mathcal{S}}_{\mathcal{A}}\left(4\right)={2}^{4}=16$ .
Clearly, ${\mathcal{S}}_{\mathcal{A}}\left(n\right)={2}^{n},n=1,\phantom{\rule{0.166667em}{0ex}}2,\phantom{\rule{0.166667em}{0ex}}3$ as well.
However, for $n=5,{\mathcal{S}}_{\mathcal{A}}\left(n\right)<{2}^{5}$ . This is because we can always select four points such that the rectangle, which just contains fourof them, contains the other point. Consequently, we cannot find a rectangle classifier which contains the four outer points and does not contain the innerpoint as shown above.
Note the ${\mathcal{S}}_{\mathcal{A}}\le {2}^{n}$ .
If $\left|\{,\{{x}_{1},...,{x}_{n}\},\bigcap ,A,,,A,\phantom{\rule{0.166667em}{0ex}},\u03f5,\phantom{\rule{0.166667em}{0ex}},\mathcal{A},\}\right|={2}^{n}$ then we say that $\mathcal{A}$ shatters ${x}_{1},\phantom{\rule{0.166667em}{0ex}}...,\phantom{\rule{0.166667em}{0ex}}{x}_{n}$ .
Let $\mathcal{A}$ be a collection of set with VC dimension ${V}_{\mathcal{A}}<\infty $ . Then $\forall n,{\mathcal{S}}_{\mathcal{A}}\left(n\right)\le {\sum}_{i=0}^{{V}_{\mathcal{A}}}\left(\begin{array}{c}n\\ i\end{array}\right)$ , also ${\mathcal{S}}_{\mathcal{A}}\left(n\right)\le {(n+1)}^{{V}_{\mathcal{A}}},\forall n$ .
Let $\mathcal{F}$ be a collection of classifiers of the form $f:{\mathcal{R}}^{d}\to \{0,1\}$ Define $\mathcal{A}=\left\{\right\{x:f\left(x\right)=1\}\times \{0\}\bigcup \{x:f\left(x\right)=0\}\times \{1\},f\phantom{\rule{0.166667em}{0ex}}\u03f5\phantom{\rule{0.166667em}{0ex}}\mathcal{F}\}$ In words, this is collection of subsets of $\mathcal{X}\times \mathcal{Y}$ for which on $f\u03f5\mathcal{F}$ maps the features $x$ to a label opposite of $y$ . The size of $\mathcal{A}$ expresses the richness of $\mathcal{F}$ . The larger $\mathcal{A}$ is the more likely it is that there exists an $f\u03f5\mathcal{F}$ for which $R\left(f\right)=P\left(f\right(X)\ne Y)$ is close to the Bayes risk ${R}^{*}=P({f}^{*}\left(X\right)\ne Y)$ where ${f}^{*}$ is the Bayes classifier. The ${n}^{th}$ shatter coefficient of $\mathcal{F}$ is defined as ${\mathcal{S}}_{\mathcal{F}}\left(n\right)={\mathcal{S}}_{\mathcal{A}}\left(n\right)$ and the VC dimesion of $\mathcal{F}$ is defined as ${V}_{\mathcal{F}}={V}_{\mathcal{A}}$ .
linear (hyperplane) classifiers in ${\mathcal{R}}^{d}$
Consider $d$ = 2. Let $n$ be the number of training points, it is easy to see that when $n=1$ , let $\mathcal{A}$ be as above. By using linear classifiers in ${\mathcal{R}}^{2}$ , it is easy to see that we can assign 1 to all possible subsets $\{\left\{{x}_{1}\right\},\phi \}$ and 0 to their complements. Hence ${\mathcal{S}}_{\mathcal{F}}\left(1\right)=2$ .
When $n=2$ , we can also assign 1 to all possible subsets $\{\{{x}_{1},{x}_{2}\},\phantom{\rule{0.166667em}{0ex}}\left\{{x}_{1}\right\},\phantom{\rule{0.166667em}{0ex}}\left\{{x}_{2}\right\},\phantom{\rule{0.166667em}{0ex}}\phi \}$ and 0 to their complements, and vice versa. Hence ${\mathcal{S}}_{\mathcal{F}}\left(2\right)=4={2}^{2}$ .
When $n=3$ , we can arrange arrange the point ${x}_{1},\phantom{\rule{0.166667em}{0ex}}{x}_{2},\phantom{\rule{0.166667em}{0ex}}{x}_{3}$ (non-colinear) so that the set of linear classifiers shatters the three points, hence ${\mathcal{S}}_{\mathcal{F}}\left(3\right)=8={2}^{3}$
When $n=4$ , no matter where the points ${x}_{1},\phantom{\rule{0.166667em}{0ex}}{x}_{2},\phantom{\rule{0.166667em}{0ex}}{x}_{3},\phantom{\rule{0.166667em}{0ex}}{x}_{4}$ and what designated binary values ${y}_{1},\phantom{\rule{0.166667em}{0ex}}{y}_{2},\phantom{\rule{0.166667em}{0ex}}{y}_{3},\phantom{\rule{0.166667em}{0ex}}{y}_{4}$ are. It's clear that $\mathcal{A}$ does not shatter the four points. To see the claim, first observe that the four points will form a 4-gon (if the four points are co-linear, or if the three points are co-linear then clearly linear classifiers cannot shatter the points). The two points that belong to the same diagonal lines form 2 groups and no linear classifier can assign different values to the 2 groups. Hence ${\mathcal{S}}_{\mathcal{F}}\left(4\right)<16={2}^{4}$ and ${V}_{\mathcal{F}}=3$ .
We state here without proving it that in general the class of linear classifiers in ${\mathcal{R}}^{d}$ has ${V}_{\mathcal{F}}=d+1$ .
Let ${X}_{1},\phantom{\rule{0.166667em}{0ex}},...,\phantom{\rule{0.166667em}{0ex}}{X}_{n}$ be i.i.d. ${\mathcal{R}}^{d}$ -valued random variables. Denote the common distribution of ${X}_{i},1\le i\le n$ by $\mu \left(A\right)=P\left({X}_{1}\phantom{\rule{0.166667em}{0ex}}\u03f5\phantom{\rule{0.166667em}{0ex}}A\right)$ for any subset $A\subset {\mathcal{R}}^{d}$ . Similarly, define the empirical distribution ${\mu}_{n}\left(A\right)=\frac{1}{n}{\sum}_{1}^{n}{1}_{\left\{{X}_{i}\u03f5A\right\}}$ .
TheoremFor any probablilty measure $\mu $ and collection of subsets $\mathcal{A}$ , and for any $\u03f5>0$ .
and
Before giving a proof to the theorem. We present a Corollary.
CorollaryLet $\mathcal{F}$ be a collection of classifiers of the form $f:{\mathcal{R}}^{d}\to \{0,1\}$ with VC dimension ${V}_{\mathcal{F}}<\infty $ , Let $R\left(f\right)=P\left(f\right(X)\ne Y)$ and ${\widehat{R}}_{n}\left(f\right)=\frac{1}{n}{\sum}_{1}^{n}{1}_{\{f\left({X}_{i}\right)\ne {Y}_{i}\}}$ , where ${X}_{i},{Y}_{i},1\le i\le n$ are i.i.d. with joint distribution ${P}_{XY}$ .
Define
${\widehat{f}}_{n}=\begin{array}{c}argmin\\ f\u03f5\mathcal{F}\end{array}{\widehat{R}}_{n}\left(f\right)$ .
Then
Let $\mathcal{A}=\left\{\right\{x:f\left(x\right)=1\}\times \{0\}\bigcup \{x:f\left(x\right)=0\}\times \{1\},f\phantom{\rule{0.166667em}{0ex}}\u03f5\phantom{\rule{0.166667em}{0ex}}\mathcal{F}\}$
Note that
where $A=\{x:f(x)=1\}\times \left\{0\right\}\bigcup \{x:f(x)=0\}\times \left\{1\right\}$ .
Similarly,
Therefore, according to the VC theorem.
Since ${V}_{\mathcal{F}}<\infty ,{\mathcal{S}}_{\mathcal{F}}\left(n\right)\le {(n+1)}^{{V}_{\mathcal{F}}}$ and
Next, note that
Therefore,
Notification Switch
Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?