<< Chapter < Page Chapter >> Page >

Linear classifiers

Suppose F = {linear classifiers in R d }, then we have

V F = d + 1 , f ^ n = arg min f F R ^ n ( f )
E [ R ( f ^ n ) ] - inf f F R ( f ) 4 ( d + 1 ) log ( n + 1 ) + log 2 n .

Generalized linear classifiers

Normally, we have a feature vector X R d . A hyperplane in R d provides a linear classifier in R d . Nonlinear classifiers can be obtained by a straightforward generalization.

Let φ 1 , , φ d ' , d ' d be a collection of functions mapping R d R . These functions, applied to a feature X R d , produce a generalized set of features, φ = ( φ 1 ( X ) , φ 2 ( X ) , , φ d ' ( X ) ) ' . For example, if X = ( x 1 , x 2 ) ' , then we could consider d ' = S and φ = ( x 1 , x 2 , x 1 x 2 , x 1 2 , x 2 2 ) ' R 5 . We can then construct a linear classifier in the higher dimensional generalized feature space R d ' .

The VC bounds immediately extend to this case, and we have for F ' = { generalized linear classifiers based on maps φ : R d R d ' },

E [ R ( f ^ n ) ] - inf f F ' R ( f ) 4 ( d ' + 1 ) log ( n + 1 ) + log 2 n .

Half-space classifiers

Theorem

Steele '75, dudley '78

Let G be a finite-dimensional vector space of real-valued functions on R d . The class of sets A = { { x : g ( x ) 0 } : g G } has VC dimension dim( G ).

It is sufficient to show that no set of n = d i m ( G ) + 1 points can be shattered by A . Take any n points and for each g G , define the vector V g = ( g ( x 1 ) , , g ( x n ) ) .

The set { V g : g G } is a linear subspace of R n of dimension dim ( G ) = n - 1 . Therefore, there exists a non-zero vector α = ( α 1 , , α n ) R n such that i = 1 n α i g ( x i ) = 0 . We can assume that at least one of these α i S is negative (if all are positive, just negate the sum). We can then re-arrange thisexpression as i : α i 0 α i g ( x i ) = i : α i < 0 - α i g ( x i ) .

Now suppose that there exists a g G such that the set { x : g ( x ) 0 } selects precisely the x i S on the left-hand side above. Then all terms on the left are non-negative and allthe terms on the right are non-positive. Since α is non-zero, this is a contradiction. Therefore, x 1 , , x n cannot be shattered by sets in { x : g ( x ) 0 } , g G .  6.375pt0.0pt6.375pt

Consider half-spaces in R d of the form A = { x R d : x i b , i { 1 , , d } , b R } . Each half-space can be described by

g ( x ) = 0 , , 0 , 1 , 0 , , 0 x 1 x d - b
d i m ( G ) = d + 1 , V A d + 1 .

Tree classifiers

Let

T k = r e c u r s i v e r e c t a n g u l a r p a r t i t i o n s o f R d w i t h k + 1 c e l l s

Let T T k . Each cell of T results from splitting a rectangular region into two smaller rectangles parallel to one ofthe coordinate axes.

T T 3 , d = 2 .

Each additional split is analogous to a half-space set. Therefore, each additional split can potentially shatter d + 1 points. This implies that

V T k ( d + 1 ) k .

d = 1 .

k = 1 split shatters two points.

k = 2 splits shatters three points < 4 .

Vc bound for tree classifiers

F k = { t r e e c l a s s i f i e r s w i t h k + 1 l e a f s o n R d }
E [ R ( f ^ n ) ] - inf f F k R ( f ) 4 ( d + 1 ) k log n + log 2 n .

How can we decide what dimension to choose for a generalized linear classifier?

How many leafs should be used for a classification tree?

Complexity Regularization using VC bounds!

Structural risk minimization (srm)

SRM is simply complexity regularization using VC type bounds in place of Chernoff's bound or other concentration inequalities.

The basic idea is to consider a sequence of sets of classifiers F 1 , F 2 , ... , of increasing VC dimensions V F 1 V F 2 ... . Then for each k = 1 , 2 , ... we find the minimum empirical risk classifier

f ^ n ( k ) = arg min f F k R ^ n ( f )

and then select the final classifier according to

k ^ = arg min k 1 R ^ n ( t ^ n ( k ) ) + 32 V F k ( log n + 1 ) n

and f ^ n f ^ n ( k ^ ) is the final choice.

The basic rational is that we know

R n ( f ^ n ( k ) ) - inf f F k R ( f ) C ' V F k log n n

where C ' is a constant.

The end result is that

E [ R ( f ^ n ) ] min k 1 min f F k R ( f ) + 16 V F k log n + 4 2 n

analogous to our pervious complexity regularization results, except thatcodelengths are replaced by VC dimensions.

In order to prove the result we use the VC probability concentration bound and assume that = k 1 V F k < . This enables a union bounding argument and leads to a risk bound of the form given above.

Key point of vc theory

Complexity of classes depends on richness (shattering capability) relative to a set of n arbitrary points. This allows us to effectively “quantize" collections of functions in a slightlydata-dependent manner.

Application to trees

Let

F k = k l e a f d e c i s i o n t r e e s i n R d , V F k ( d + 1 ) ( k + 1 )
f ^ n ( k ) = arg min f F k R ^ n ( f )
k ^ = arg min k 1 min f F k R ( f ) + 32 ( d + 1 ) ( k - 1 ) ( log n + 1 ) n

Then

f ^ n = f ^ n ( k ^ )

satisfies

E [ R ( f ^ n ) ] min k 1 min f F k R ( f ) + 16 ( d + 1 ) ( k - 1 ) log n + 4 2 n

compare with

E [ R ( f ^ n ) ] min k 1 min f d y a d i c k l e a f t r e e s R ( f ) + ( 3 k - 1 ) log 2 + 1 2 log n 2 n

from Lecture 11 .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask