The number of unique labellings of the training data that can be
achieved with linear classifiers is, in fact, finite. A line can bedefined by picking
any pair of training points, as illustrated
in
[link] . Two classifiers can be defined from each
such line: one that outputs a label “1” for everything on or abovethe line, and another that outputs “0” for everything on or above.
There exist
$\left(\genfrac{}{}{0pt}{}{n}{2}\right)$ such pairs of training points, and these
define all possible unique labellings of the training data.Therefore, there are at most
$2\left(\genfrac{}{}{0pt}{}{n}{2}\right)$ unique linear
classifiers for any random set of
$n$ 2-dimensional features (the
factor of 2 is due to the fact that for each linear classifier thereare 2 possible assignments of the labelling).
Thus, instead of infinitely many linear classifiers, we realize that
as far as a random sample of
$n$ training data is concerned, there are
at most
unique linear classifiers. That is, using linear classification
rules, there are at most
$n(n-1)\approx {n}^{2}$ unique label assignments
for
$n$ data points. If we like, we can encode each possibility with
${log}_{2}n(n-1)\approx 2{log}_{2}n$ bits. In
$d$ dimensions there are
$2\left(\genfrac{}{}{0pt}{}{n}{d}\right)$ hyperplane classification rules which can be
encoded in roughly
$d{log}_{2}n$ bits. Roughly speaking, the number of
bits required for encoding each model is the VC dimension. Theremarkable aspect of the VC dimension is that it is often finite even
when
$\mathcal{F}$ is infinite (as in this example).
If
$\mathcal{X}$ has
$d$ dimensions in total, we might consider linear
classifiers based on
$1,2,\cdots ,d$ features at a time. Lower
dimensional hyperplanes are less complex than higher dimensionalones. Suppose we set
These spaces have increasing VC dimensions, and we can try to balance
the empirical risk and a cost function depending on the VC dimension.Such procedures are often referred to as
Structural Risk
Minimization . This gives you a glimpse of what the VC dimension is
all about. In future lectures we will revisit this topic in greaterdetail.
Hold-out methods
The basic idea of “hold-out” methods is to split the
$n$ samples
$D\equiv {\{{X}_{i},{Y}_{i}\}}_{i=1}^{n}$ into a training set,
${D}_{T}$ ,
and a test set,
${D}_{V}$ .
Now, suppose we have a collection of different model spaces
$\left\{{\mathcal{F}}_{\lambda}\right\}$ indexed by
$\lambda \in \Lambda $ (e.g.,
${\mathcal{F}}_{\lambda}$ is the set of polynomials of degree
$d$ , with
$\lambda =d$ ), or
suppose that we have a collection of complexity penalization criteria
${L}_{\lambda}\left(f\right)$ indexed by
$\lambda $ (
e.g., let
${L}_{\lambda}\left(f\right)=\widehat{R}\left(f\right)+\lambda c\left(f\right)$ , with
$\lambda \in {\mathbf{R}}^{+}$ ). We can
obtain candidate solutions using the training set as follows. Define
This provides us with a set of candidate solutions
$\left\{{\widehat{f}}_{\lambda}\right\}$ . Then we can define the hold-out error estimate using
the test set:
This type of procedure has many nice theoretical guarantees, provided
both the training and test set grow with
$n$ .
Leaving-one-out cross-validation
A very popular hold-out method is the so call “leaving-one-out
cross-validation” studied in depth by Grace Wahba (UW-Madison,Statistics). For each
$\lambda $ we compute
To summarize, this lecture gave a brief and incomplete survey of
different methods for dealing with the issues of overfitting and modelselection. Given a set of training data,
${D}_{n}={\{{X}_{i},{Y}_{i}\}}_{i=1}^{n}$ ,
our overall goal is to find
from some collection of functions,
$\mathcal{F}$ . Because we do not know the true distribution
${P}_{XY}$ underlyingthe data points
${D}_{n}$ , it is difficult to get an exact handle on the
risk,
$R\left(f\right)$ . If we only focus on minimizing the empirical risk
$\widehat{R}\left(f\right)$ we end up overfitting to the training data. Two
general approaches were presented.
In the first approach we consider an indexed collection of
spaces
${\left\{{\mathcal{F}}_{\lambda}\right\}}_{\lambda \in \Lambda}$ such that the
complexity of
${\mathcal{F}}_{\lambda}$ increases as
$\lambda $ increases, and
or
${\lambda}^{*}$ is chosen by hold-out validation.
The alternative approach is to incorporate a penalty term
into the risk minimization problem formulation. Here we consideran indexed collection of penalties
${\left\{{C}_{\lambda}\right\}}_{\lambda \in \Lambda}$ satisfying the following properties:
${C}_{\lambda}:\mathcal{F}\to {\mathbf{R}}^{+}$ ;
For each
$f\in \mathcal{F}$ and
${\lambda}_{1}<{\lambda}_{2}$ we have
${C}_{{\lambda}_{1}}\left(f\right)\le {C}_{{\lambda}_{2}}\left(f\right)$ ;
There exists
${\lambda}_{0}\in \Lambda $ such that
${C}_{{\lambda}_{0}}\left(f\right)=0$ for all
$f\in \mathcal{F}$ .
where either
${\lambda}^{*}=\lambda \left(n\right)$ , a function growing the
number of data samples
$n$ , or
${\lambda}^{*}$ is selected by hold-out
validation.
Consistency
If an estimator or classifier
${\widehat{f}}_{{\lambda}^{*}}$ satisfies
then we say that
${\widehat{f}}_{{\lambda}^{*}}$ is
$\mathcal{F}$ -consistent with respect to the risk
$R$ . When the context is clear, we will simply say that
$\widehat{f}$ is consistent.
At high concentrations (>0.01 M), the relation between absorptivity coefficient and absorbance is no longer linear. This is due to the electrostatic interactions between the quantum dots in close proximity. If the concentration of the solution is high, another effect that is seen is the scattering of light from the large number of quantum dots. This assumption only works at low concentrations of the analyte. Presence of stray light.