The number of unique labellings of the training data that can be
achieved with linear classifiers is, in fact, finite. A line can bedefined by picking
any pair of training points, as illustrated
in
[link] . Two classifiers can be defined from each
such line: one that outputs a label “1” for everything on or abovethe line, and another that outputs “0” for everything on or above.
There exist
$\left(\genfrac{}{}{0pt}{}{n}{2}\right)$ such pairs of training points, and these
define all possible unique labellings of the training data.Therefore, there are at most
$2\left(\genfrac{}{}{0pt}{}{n}{2}\right)$ unique linear
classifiers for any random set of
$n$ 2-dimensional features (the
factor of 2 is due to the fact that for each linear classifier thereare 2 possible assignments of the labelling).
Thus, instead of infinitely many linear classifiers, we realize that
as far as a random sample of
$n$ training data is concerned, there are
at most
unique linear classifiers. That is, using linear classification
rules, there are at most
$n(n-1)\approx {n}^{2}$ unique label assignments
for
$n$ data points. If we like, we can encode each possibility with
${log}_{2}n(n-1)\approx 2{log}_{2}n$ bits. In
$d$ dimensions there are
$2\left(\genfrac{}{}{0pt}{}{n}{d}\right)$ hyperplane classification rules which can be
encoded in roughly
$d{log}_{2}n$ bits. Roughly speaking, the number of
bits required for encoding each model is the VC dimension. Theremarkable aspect of the VC dimension is that it is often finite even
when
$\mathcal{F}$ is infinite (as in this example).
If
$\mathcal{X}$ has
$d$ dimensions in total, we might consider linear
classifiers based on
$1,2,\cdots ,d$ features at a time. Lower
dimensional hyperplanes are less complex than higher dimensionalones. Suppose we set
These spaces have increasing VC dimensions, and we can try to balance
the empirical risk and a cost function depending on the VC dimension.Such procedures are often referred to as
Structural Risk
Minimization . This gives you a glimpse of what the VC dimension is
all about. In future lectures we will revisit this topic in greaterdetail.
Hold-out methods
The basic idea of “hold-out” methods is to split the
$n$ samples
$D\equiv {\{{X}_{i},{Y}_{i}\}}_{i=1}^{n}$ into a training set,
${D}_{T}$ ,
and a test set,
${D}_{V}$ .
Now, suppose we have a collection of different model spaces
$\left\{{\mathcal{F}}_{\lambda}\right\}$ indexed by
$\lambda \in \Lambda $ (e.g.,
${\mathcal{F}}_{\lambda}$ is the set of polynomials of degree
$d$ , with
$\lambda =d$ ), or
suppose that we have a collection of complexity penalization criteria
${L}_{\lambda}\left(f\right)$ indexed by
$\lambda $ (
e.g., let
${L}_{\lambda}\left(f\right)=\widehat{R}\left(f\right)+\lambda c\left(f\right)$ , with
$\lambda \in {\mathbf{R}}^{+}$ ). We can
obtain candidate solutions using the training set as follows. Define
This provides us with a set of candidate solutions
$\left\{{\widehat{f}}_{\lambda}\right\}$ . Then we can define the hold-out error estimate using
the test set:
This type of procedure has many nice theoretical guarantees, provided
both the training and test set grow with
$n$ .
Leaving-one-out cross-validation
A very popular hold-out method is the so call “leaving-one-out
cross-validation” studied in depth by Grace Wahba (UW-Madison,Statistics). For each
$\lambda $ we compute
To summarize, this lecture gave a brief and incomplete survey of
different methods for dealing with the issues of overfitting and modelselection. Given a set of training data,
${D}_{n}={\{{X}_{i},{Y}_{i}\}}_{i=1}^{n}$ ,
our overall goal is to find
from some collection of functions,
$\mathcal{F}$ . Because we do not know the true distribution
${P}_{XY}$ underlyingthe data points
${D}_{n}$ , it is difficult to get an exact handle on the
risk,
$R\left(f\right)$ . If we only focus on minimizing the empirical risk
$\widehat{R}\left(f\right)$ we end up overfitting to the training data. Two
general approaches were presented.
In the first approach we consider an indexed collection of
spaces
${\left\{{\mathcal{F}}_{\lambda}\right\}}_{\lambda \in \Lambda}$ such that the
complexity of
${\mathcal{F}}_{\lambda}$ increases as
$\lambda $ increases, and
or
${\lambda}^{*}$ is chosen by hold-out validation.
The alternative approach is to incorporate a penalty term
into the risk minimization problem formulation. Here we consideran indexed collection of penalties
${\left\{{C}_{\lambda}\right\}}_{\lambda \in \Lambda}$ satisfying the following properties:
${C}_{\lambda}:\mathcal{F}\to {\mathbf{R}}^{+}$ ;
For each
$f\in \mathcal{F}$ and
${\lambda}_{1}<{\lambda}_{2}$ we have
${C}_{{\lambda}_{1}}\left(f\right)\le {C}_{{\lambda}_{2}}\left(f\right)$ ;
There exists
${\lambda}_{0}\in \Lambda $ such that
${C}_{{\lambda}_{0}}\left(f\right)=0$ for all
$f\in \mathcal{F}$ .
where either
${\lambda}^{*}=\lambda \left(n\right)$ , a function growing the
number of data samples
$n$ , or
${\lambda}^{*}$ is selected by hold-out
validation.
Consistency
If an estimator or classifier
${\widehat{f}}_{{\lambda}^{*}}$ satisfies
then we say that
${\widehat{f}}_{{\lambda}^{*}}$ is
$\mathcal{F}$ -consistent with respect to the risk
$R$ . When the context is clear, we will simply say that
$\widehat{f}$ is consistent.
fullerene is a bucky ball aka Carbon 60 molecule. It was name by the architect Fuller. He design the geodesic dome. it resembles a soccer ball.
Tarell
what is the actual application of fullerenes nowadays?
Damian
That is a great question Damian. best way to answer that question is to Google it. there are hundreds of applications for buck minister fullerenes, from medical to aerospace. you can also find plenty of research papers that will give you great detail on the potential applications of fullerenes.
Tarell
what is the Synthesis, properties,and applications of carbon nano chemistry
Yeah, it is a pain to say the least. You basically have to heat the substarte up to around 1000 degrees celcius then pass phosphene gas over top of it, which is explosive and toxic by the way, under very low pressure.
Harper
Do you know which machine is used to that process?