<< Chapter < Page | Chapter >> Page > |
Thus, $\widehat{\epsilon}\left({h}_{i}\right)$ is exactly the mean of the $m$ random variables ${Z}_{j}$ that are drawn iid from a Bernoulli distribution with mean $\epsilon \left({h}_{i}\right)$ . Hence, we can apply the Hoeffding inequality, and obtain
This shows that, for our particular ${h}_{i}$ , training error will be close to generalization error with high probability, assuming $m$ is large. But we don't just want to guarantee that $\epsilon \left({h}_{i}\right)$ will be close to $\widehat{\epsilon}\left({h}_{i}\right)$ (with high probability) for just only one particular ${h}_{i}$ . We want to prove that this will be true for simultaneously for all $h\in \mathcal{H}$ . To do so, let ${A}_{i}$ denote the event that $|\epsilon \left({h}_{i}\right)-\widehat{\epsilon}\left({h}_{i}\right)|>\gamma $ . We've already show that, for any particular ${A}_{i}$ , it holds true that $P\left({A}_{i}\right)\le 2exp(-2{\gamma}^{2}m)$ . Thus, using the union bound, we have that
If we subtract both sides from 1, we find that
(The “ $\neg $ ” symbol means “not.”) So, with probability at least $1-2kexp(-2{\gamma}^{2}m)$ , we have that $\epsilon \left(h\right)$ will be within $\gamma $ of $\widehat{\epsilon}\left(h\right)$ for all $h\in \mathcal{H}$ . This is called a uniform convergence result, because this is a bound that holds simultaneously for all (as opposed to just one) $h\in \mathcal{H}$ .
In the discussion above, what we did was, for particular values of $m$ and $\gamma $ , give a bound on the probability that for some $h\in \mathcal{H}$ , $|\epsilon \left(h\right)-\widehat{\epsilon}\left(h\right)|>\gamma $ . There are three quantities of interest here: $m$ , $\gamma $ , and the probability of error; we can bound either one in terms of the other two.
For instance, we can ask the following question: Given $\gamma $ and some $\delta >0$ , how large must $m$ be before we can guarantee that with probability at least $1-\delta $ , training error will be within $\gamma $ of generalization error? By setting $\delta =2kexp(-2{\gamma}^{2}m)$ and solving for $m$ , [you should convince yourself this is the right thing to do!], we find that if
then with probability at least $1-\delta $ , we have that $|\epsilon \left(h\right)-\widehat{\epsilon}\left(h\right)|\le \gamma $ for all $h\in \mathcal{H}$ . (Equivalently, this shows that the probability that $|\epsilon \left(h\right)-\widehat{\epsilon}\left(h\right)|>\gamma $ for some $h\in \mathcal{H}$ is at most $\delta $ .) This bound tells us how many training examples we need in order makea guarantee. The training set size $m$ that a certain method or algorithm requires in order to achieve a certain level of performance is also calledthe algorithm's sample complexity .
The key property of the bound above is that the number of training examples needed to make this guarantee is only logarithmic in $k$ , the number of hypotheses in $\mathcal{H}$ . This will be important later.
Similarly, we can also hold $m$ and $\delta $ fixed and solve for $\gamma $ in the previous equation, and show [again, convince yourself that this is right!]that with probability $1-\delta $ , we have that for all $h\in \mathcal{H}$ ,
Now, let's assume that uniform convergence holds, i.e., that $|\epsilon \left(h\right)-\widehat{\epsilon}\left(h\right)|\le \gamma $ for all $h\in \mathcal{H}$ . What can we prove about the generalization of our learning algorithm that picked $\widehat{h}=arg{min}_{h\in \mathcal{H}}\widehat{\epsilon}\left(h\right)$ ?
Define ${h}^{*}=arg{min}_{h\in \mathcal{H}}\epsilon \left(h\right)$ to be the best possible hypothesis in $\mathcal{H}$ . Note that ${h}^{*}$ is the best that we could possibly do given that we are using $\mathcal{H}$ , so it makes sense to compare our performance to that of ${h}^{*}$ . We have:
The first line used the fact that $|\epsilon \left(\widehat{h}\right)-\widehat{\epsilon}\left(\widehat{h}\right)|\le \gamma $ (by our uniform convergence assumption). The second used the fact that $\widehat{h}$ was chosen to minimize $\widehat{\epsilon}\left(h\right)$ , and hence $\widehat{\epsilon}\left(\widehat{h}\right)\le \widehat{\epsilon}\left(h\right)$ for all $h$ , and in particular $\widehat{\epsilon}\left(\widehat{h}\right)\le \widehat{\epsilon}\left({h}^{*}\right)$ . The third line used the uniform convergence assumption again, to show that $\widehat{\epsilon}\left({h}^{*}\right)\le \epsilon \left({h}^{*}\right)+\gamma $ . So, what we've shown is the following: If uniform convergence occurs,then the generalization error of $\widehat{h}$ is at most $2\gamma $ worse than the best possible hypothesis in $\mathcal{H}$ !
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?