<< Chapter < Page | Chapter >> Page > |
Rather than applying SVMs using the original input attributes , we may instead want to learn using some features . To do so, we simply need to go over our previous algorithm, and replace everywhere in it with .
Since the algorithm can be written entirely in terms of the inner products , this means that we would replace all those inner products with . Specificically, given a feature mapping , we define the corresponding Kernel to be
Then, everywhere we previously had in our algorithm, we could simply replace it with , and our algorithm would now be learning using the features .
Now, given , we could easily compute by finding and and taking their inner product. But what's more interesting is that often, may be very inexpensive to calculate, even though itself may be very expensive to calculate (perhaps because it is an extremely high dimensional vector). In such settings, by using in our algorithman efficient way to calculate , we can get SVMs to learn in the high dimensional feature space space given by , but without ever having to explicitly find or represent vectors .
Let's see an example. Suppose , and consider
We can also write this as
Thus, we see that , where the feature mapping is given (shown here for the case of ) by
Note that whereas calculating the high-dimensional requires time, finding takes only time—linear in the dimension of the input attributes.
For a related kernel, also consider
(Check this yourself.) This corresponds to the feature mapping (again shown for )
and the parameter controls the relative weighting between the (first order) and the (second order) terms.
More broadly, the kernel corresponds to a feature mapping to an feature space, corresponding of all monomials of the form that are up to order . However, despite working in this -dimensional space, computing still takes only time, and hence we never need to explicitly represent feature vectors in this very high dimensionalfeature space.
Now, let's talk about a slightly different view of kernels. Intuitively, (and there are things wrong with this intuition, but nevermind),if and are close together, then we might expect to be large. Conversely, if and are far apart—say nearly orthogonal to each other—then will be small. So, we can think of as some measurement of how similar are and , or of how similar are and .
Given this intuition, suppose that for some learning problem that you're working on, you've come up with some function that you think might be a reasonable measure of how similar and are. For instance, perhaps you chose
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?