<< Chapter < Page Chapter >> Page >
w i L ( w * , α * , β * ) = 0 , i = 1 , ... , n β i L ( w * , α * , β * ) = 0 , i = 1 , ... , l α i * g i ( w * ) = 0 , i = 1 , ... , k g i ( w * ) 0 , i = 1 , ... , k α * 0 , i = 1 , ... , k

Moreover, if some w * , α * , β * satisfy the KKT conditions, then it is also a solution to the primal and dual problems.

We draw attention to the third equation in  [link] , which is called the KKT dual complementarity condition. Specifically, it implies that if α i * > 0 , then g i ( w * ) = 0 . (I.e., the “ g i ( w ) 0 ” constraint is active , meaning it holds with equality rather than with inequality.)Later on, this will be key for showing that the SVM has only a small number of “support vectors”; the KKT dual complementarity condition will also give us our convergence test when we talkabout the SMO algorithm.

Optimal margin classifiers

Previously, we posed the following (primal) optimization problem for finding the optimal margin classifier:

min γ , w , b 1 2 | | w | | 2 s.t. y ( i ) ( w T x ( i ) + b ) 1 , i = 1 , ... , m

We can write the constraints as

g i ( w ) = - y ( i ) ( w T x ( i ) + b ) + 1 0 .

We have one such constraint for each training example. Note that from the KKT dualcomplementarity condition, we will have α i > 0 only for the training examples that have functional margin exactly equal to one (i.e., the ones correspondingto constraints that hold with equality, g i ( w ) = 0 ). Consider the figure below, in which a maximum margin separating hyperplane is shown by the solid line.

the same data set with different parallel lines drawn on it for better fit

The points with the smallest margins are exactly the ones closest to the decision boundary; here, these are the three points (one negative and two positive examples)that lie on the dashed lines parallel to the decision boundary. Thus, only three of the α i 's—namely, the ones corresponding to these three training examples—will be non-zero at the optimal solution to our optimization problem. These three pointsare called the support vectors in this problem. The fact that the number of support vectors can be much smaller than the size the training set will be useful later.

Let's move on. Looking ahead, as we develop the dual form of the problem, one key idea to watch out for is that we'll try to write our algorithm in terms of only the inner product x ( i ) , x ( j ) (think of this as ( x ( i ) ) T x ( j ) ) between points in the input feature space. The fact that we can express our algorithm interms of these inner products will be key when we apply the kernel trick.

When we construct the Lagrangian for our optimization problem we have:

L ( w , b , α ) = 1 2 | | w | | 2 - i = 1 m α i y ( i ) ( w T x ( i ) + b ) - 1 .

Note that there're only “ α i ” but no “ β i ” Lagrange multipliers, since the problem has only inequality constraints.

Let's find the dual form of the problem. To do so, we need to first minimize L ( w , b , α ) with respect to w and b (for fixed α ), to get θ D , which we'll do by setting the derivatives of L with respect to w and b to zero. We have:

w L ( w , b , α ) = w - i = 1 m α i y ( i ) x ( i ) = 0

This implies that

w = i = 1 m α i y ( i ) x ( i ) .

As for the derivative with respect to b , we obtain

b L ( w , b , α ) = i = 1 m α i y ( i ) = 0 .

If we take the definition of w in Equation  [link] and plug that back into the Lagrangian (Equation  [link] ), and simplify, we get

L ( w , b , α ) = i = 1 m α i - 1 2 i , j = 1 m y ( i ) y ( j ) α i α j ( x ( i ) ) T x ( j ) - b i = 1 m α i y ( i ) .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask