0.11 Decision trees (Page 4/5)

Statistical learning theory Page 4 / 5

We define our estimator as

{\hat{f}}_{n}^{H} = {\hat{f}}_{n}^{\hat{k}},

where

{\hat{f}}_{n}^{(k)} = arg min_{f \in F_{k}^{H}} {\hat{R}}_{n} (f),

and

\hat{k} = arg min_{k \geq 1} \{{\hat{R}}_{n} ({\hat{f}}_{n}^{(k)} + \sqrt{\frac{(k + k^{2}) log 2 + \frac{1}{2} log n}{2 n}}\} .

Therefore ${\hat{f}}_{n}^{H}$ minimizes

{\hat{R}}_{n} (f) + \sqrt{\frac{c_{H} (f) log 2 + \frac{1}{2} log n}{2 n}},

over all $f \in F^{H}$ . We showed before that

E [R ({\hat{f}}_{n}^{H})] - R^{*} \leq min_{f \in F^{H}} \{R (f) - R^{*} + \sqrt{\frac{c_{H} (f) log 2 + \frac{1}{2} log n}{2 n}}\} + \frac{1}{\sqrt{n}} .

To proceed with our analysis we need to make some assumptions on the intrinsic difficulty of the problem. We will assume that theBayes decision boundary is a “well-behaved” 1-dimensional set, in the sense that it has box-counting dimension one (seeAppendix "Box Counting Dimension" ). This implies that, for an histogram with $k^{2}$ bins, the Bayes decision boundary intersects less than $C k$ bins, where $C$ is a constant that does not depend on $k$ . Furthermore we assume that the marginal distribution of $X$ satisfies $P_{X} (A) \leq K | A |$ , for any measurable subset $A \subseteq {[0, 1]}^{2}$ . This means that the samples collected do not accumulate anywhere in the unit square.

Under the above assumptions we can conclude that

min_{f \in F_{k}^{H}} R (f) - R^{*} \leq \frac{K}{k^{2}} C k = \frac{C K}{k} .

Therefore

E [R ({\hat{f}}_{n}^{H})] - R^{*} \leq C K / k + \sqrt{\frac{(k + k^{2}) log 2 + \frac{1}{2} log n}{2 n}} + \frac{1}{\sqrt{n}} .

We can balance the terms in the right side of the above expression using $k = n^{1 / 4}$ (for $n$ large) therefore

E [R ({\hat{f}}_{n}^{H})] - R^{*} = O (n^{- 1 / 4}), as n \to \infty .

Dyadic decision trees

Now let's consider the dyadic decision trees, under the assumptions above, and contrast these with the histogram classifier. Let

F_{k}^{T} = {tree classifiers with k leafs} .

Let $F^{T} = ⋃_{k \geq 1} F_{k}^{T}$ . We can prefix encode each element $f$ of $F^{T}$ with $c_{T} (f) = 3 k - 1$ bits, as described before.

Let

{\hat{f}}_{n}^{T} = {\hat{f}}_{n}^{(\hat{k})},

where

{\hat{f}}_{n}^{(k)} = arg min_{f \in F_{k}^{T}} {\hat{R}}_{n} (f),

and

\hat{k} = arg min_{k \geq 1} \{{\hat{R}}_{n} ({\hat{f}}_{n}^{(k)} + \sqrt{\frac{(3 k - 1) log 2 + \frac{1}{2} log n}{2 n}}\} .

Hence ${\hat{f}}_{n}^{T}$ minimizes

{\hat{R}}_{n} (f) + \sqrt{\frac{c_{T} (f) log 2 + \frac{1}{2} log n}{2 n}},

over all $f \in F^{T}$ . Moreover

E [R ({\hat{f}}_{n}^{T})] - R^{*} \leq min_{f \in F^{T}} \{R (f) - R^{*} + \sqrt{\frac{c_{T} (f) log 2 + \frac{1}{2} log n}{2 n}}\} + \frac{1}{\sqrt{n}} .

If the Bayes decision boundary is a 1-dimensional set, as in "Histogram Risk Bound" , there exists a tree with at most $8 C k$ leafs such that the boundary is contained in at most $C k$ squares, each of volume $1 / k^{2}$ . To see this, start with a tree yielding the histogram partition with $k^{2}$ boxes ( i.e., the tree partitioning the unit square into $k^{2}$ equal sized squares). Now prune all the nodes that do not intersect the boundary. In [link] we illustrate the procedure. If you carefully bound the number of leafs you need at each level you canshow that you will have in total less than $8 C k$ leafs. We conclude then that there exists a tree with at most $8 C k$ leafs that has the same risk as a histogram with $O (k^{2})$ bins. Therefore, using [link] we have

E [R ({\hat{f}}_{n}^{T})] - R^{*} \leq C K / k + \sqrt{\frac{(3 (8 C k) - 1) log 2 + \frac{1}{2} log n}{2 n}} + \frac{1}{\sqrt{n}} .

We can balance the terms in the right side of the above expression using $k = n^{1 / 3}$ (for $n$ large) therefore

E [R ({\hat{f}}_{n}^{T})] - R^{*} = O (n^{- 1 / 3}), as n \to \infty .

Illustration of the tree pruning procedure: (a) Histogram classification rule, for a partition with 16 bins, andcorresponding binary tree representation (with 16 leafs). (b) Pruned version of the histogram tree, yielding exactly the sameclassification rule, but now requiring only 6 leafs. ( *Note:* The trees where constructed using the procedure ofFigure )

Final comments

Trees generally work much better than histogram classifiers. This is essentially because they provide much more efficient ways ofapproximating the Bayes decision boundary (as we saw in our example, under reasonable assumptions on the Bayes boundary, a tree encodedwith $O (k)$ bits can describe the same classifier as an histogram that requires $O (k^{2})$ bits).

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask

	Gastrointestinal Pathophysiology 2006 By Tamsin Knox Start Exam
	Spreadsheets MCQ By Ryan Lowe Start Quiz
	Managerial Psychology Exam By Dan Ariely Start Exam
	Summer kitchens By Heather McAvoy Start Quiz
©flickr: Iqbal	Liver Cancer By Darlene Paliswat Start Test
	Concepts of biology By OpenStax Read Online Course
	4 AP 04 Tissue Level of Organization Essay By OpenStax Start Flashcards
	Art History ARTH209 20th Century By Rebecca Butterfield Start Quiz
	10 Psychology MCQ 2011 Final Exam By John Gabrieli Start Exam
©flickr: Dave	Infection Control By Cath Yu Start Quiz