<< Chapter < Page Chapter >> Page >
Φ ( x ) = x x 2 x 3 .

Rather than applying SVMs using the original input attributes x , we may instead want to learn using some features Φ ( x ) . To do so, we simply need to go over our previous algorithm, and replace x everywhere in it with Φ ( x ) .

Since the algorithm can be written entirely in terms of the inner products x , z , this means that we would replace all those inner products with Φ ( x ) , Φ ( z ) . Specificically, given a feature mapping Φ , we define the corresponding Kernel to be

K ( x , z ) = Φ ( x ) T Φ ( z ) .

Then, everywhere we previously had x , z in our algorithm, we could simply replace it with K ( x , z ) , and our algorithm would now be learning using the features Φ .

Now, given Φ , we could easily compute K ( x , z ) by finding Φ ( x ) and Φ ( z ) and taking their inner product. But what's more interesting is that often, K ( x , z ) may be very inexpensive to calculate, even though Φ ( x ) itself may be very expensive to calculate (perhaps because it is an extremely high dimensional vector). In such settings, by using in our algorithman efficient way to calculate K ( x , z ) , we can get SVMs to learn in the high dimensional feature space space given by Φ , but without ever having to explicitly find or represent vectors Φ ( x ) .

Let's see an example. Suppose x , z R n , and consider

K ( x , z ) = ( x T z ) 2 .

We can also write this as

K ( x , z ) = i = 1 n x i z i j = 1 n x i z i = i = 1 n j = 1 n x i x j z i z j = i , j = 1 n ( x i x j ) ( z i z j )

Thus, we see that K ( x , z ) = Φ ( x ) T Φ ( z ) , where the feature mapping Φ is given (shown here for the case of n = 3 ) by

Φ ( x ) = x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 x 2 x 2 x 2 x 3 x 3 x 1 x 3 x 2 x 3 x 3 .

Note that whereas calculating the high-dimensional Φ ( x ) requires O ( n 2 ) time, finding K ( x , z ) takes only O ( n ) time—linear in the dimension of the input attributes.

For a related kernel, also consider

K ( x , z ) = ( x T z + c ) 2 = i , j = 1 n ( x i x j ) ( z i z j ) + i = 1 n ( 2 c x i ) ( 2 c z i ) + c 2 .

(Check this yourself.) This corresponds to the feature mapping (again shown for n = 3 )

Φ ( x ) = x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 x 2 x 2 x 2 x 3 x 3 x 1 x 3 x 2 x 3 x 3 2 c x 1 2 c x 2 2 c x 3 c ,

and the parameter c controls the relative weighting between the x i (first order) and the x i x j (second order) terms.

More broadly, the kernel K ( x , z ) = ( x T z + c ) d corresponds to a feature mapping to an d n + d feature space, corresponding of all monomials of the form x i 1 x i 2 ... x i k that are up to order d . However, despite working in this O ( n d ) -dimensional space, computing K ( x , z ) still takes only O ( n ) time, and hence we never need to explicitly represent feature vectors in this very high dimensionalfeature space.

Now, let's talk about a slightly different view of kernels. Intuitively, (and there are things wrong with this intuition, but nevermind),if Φ ( x ) and Φ ( z ) are close together, then we might expect K ( x , z ) = Φ ( x ) T Φ ( z ) to be large. Conversely, if Φ ( x ) and Φ ( z ) are far apart—say nearly orthogonal to each other—then K ( x , z ) = Φ ( x ) T Φ ( z ) will be small. So, we can think of K ( x , z ) as some measurement of how similar are Φ ( x ) and Φ ( z ) , or of how similar are x and z .

Given this intuition, suppose that for some learning problem that you're working on, you've come up with some function K ( x , z ) that you think might be a reasonable measure of how similar x and z are. For instance, perhaps you chose

K ( x , z ) = exp - | | x - z | | 2 2 σ 2 .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask