2.2 Machine learning lecture 3 course notes (Page 8/12)

Machine learning

Page 8 / 12

Φ (x) = [\begin{matrix} x \\ x^{2} \\ x^{3} \end{matrix}] .

Rather than applying SVMs using the original input attributes $x$ , we may instead want to learn using some features $Φ (x)$ . To do so, we simply need to go over our previous algorithm, and replace $x$ everywhere in it with $Φ (x)$ .

Since the algorithm can be written entirely in terms of the inner products $⟨ x, z ⟩$ , this means that we would replace all those inner products with $⟨ Φ (x), Φ (z) ⟩$ . Specificically, given a feature mapping $Φ$ , we define the corresponding Kernel to be

K (x, z) = Φ {(x)}^{T} Φ (z) .

Then, everywhere we previously had $⟨ x, z ⟩$ in our algorithm, we could simply replace it with $K (x, z)$ , and our algorithm would now be learning using the features $Φ$ .

Now, given $Φ$ , we could easily compute $K (x, z)$ by finding $Φ (x)$ and $Φ (z)$ and taking their inner product. But what's more interesting is that often, $K (x, z)$ may be very inexpensive to calculate, even though $Φ (x)$ itself may be very expensive to calculate (perhaps because it is an extremely high dimensional vector). In such settings, by using in our algorithman efficient way to calculate $K (x, z)$ , we can get SVMs to learn in the high dimensional feature space space given by $Φ$ , but without ever having to explicitly find or represent vectors $Φ (x)$ .

Let's see an example. Suppose $x, z \in R^{n}$ , and consider

K (x, z) = {(x^{T} z)}^{2} .

We can also write this as

\begin{matrix} K (x, z) & = & (\sum_{i = 1}^{n}, x_{i}, z_{i}) (\sum_{j = 1}^{n}, x_{i}, z_{i}) \\ = & \sum_{i = 1}^{n} \sum_{j = 1}^{n} x_{i} x_{j} z_{i} z_{j} \\ = & \sum_{i, j = 1}^{n} (x_{i} x_{j}) (z_{i} z_{j}) \end{matrix}

Thus, we see that $K (x, z) = Φ {(x)}^{T} Φ (z)$ , where the feature mapping $Φ$ is given (shown here for the case of $n = 3$ ) by

Φ (x) = [\begin{matrix} x_{1} x_{1} \\ x_{1} x_{2} \\ x_{1} x_{3} \\ x_{2} x_{1} \\ x_{2} x_{2} \\ x_{2} x_{3} \\ x_{3} x_{1} \\ x_{3} x_{2} \\ x_{3} x_{3} \end{matrix}] .

Note that whereas calculating the high-dimensional $Φ (x)$ requires $O (n^{2})$ time, finding $K (x, z)$ takes only $O (n)$ time—linear in the dimension of the input attributes.

For a related kernel, also consider

\begin{matrix} K (x, z) & = & {(x^{T} z + c)}^{2} \\ = & \sum_{i, j = 1}^{n} (x_{i} x_{j}) (z_{i} z_{j}) + \sum_{i = 1}^{n} (\sqrt{2 c} x_{i}) (\sqrt{2 c} z_{i}) + c^{2} . \end{matrix}

(Check this yourself.) This corresponds to the feature mapping (again shown for $n = 3$ )

Φ (x) = [\begin{matrix} x_{1} x_{1} \\ x_{1} x_{2} \\ x_{1} x_{3} \\ x_{2} x_{1} \\ x_{2} x_{2} \\ x_{2} x_{3} \\ x_{3} x_{1} \\ x_{3} x_{2} \\ x_{3} x_{3} \\ \sqrt{2 c} x_{1} \\ \sqrt{2 c} x_{2} \\ \sqrt{2 c} x_{3} \\ c \end{matrix}],

and the parameter $c$ controls the relative weighting between the $x_{i}$ (first order) and the $x_{i} x_{j}$ (second order) terms.

More broadly, the kernel $K (x, z) = {(x^{T} z + c)}^{d}$ corresponds to a feature mapping to an $(_{d}^{n + d})$ feature space, corresponding of all monomials of the form $x_{i_{1}} x_{i_{2}} ... x_{i_{k}}$ that are up to order $d$ . However, despite working in this $O (n^{d})$ -dimensional space, computing $K (x, z)$ still takes only $O (n)$ time, and hence we never need to explicitly represent feature vectors in this very high dimensionalfeature space.

Now, let's talk about a slightly different view of kernels. Intuitively, (and there are things wrong with this intuition, but nevermind),if $Φ (x)$ and $Φ (z)$ are close together, then we might expect $K (x, z) = Φ {(x)}^{T} Φ (z)$ to be large. Conversely, if $Φ (x)$ and $Φ (z)$ are far apart—say nearly orthogonal to each other—then $K (x, z) = Φ {(x)}^{T} Φ (z)$ will be small. So, we can think of $K (x, z)$ as some measurement of how similar are $Φ (x)$ and $Φ (z)$ , or of how similar are $x$ and $z$ .

Given this intuition, suppose that for some learning problem that you're working on, you've come up with some function $K (x, z)$ that you think might be a reasonable measure of how similar $x$ and $z$ are. For instance, perhaps you chose

K (x, z) = exp (- \frac{| | x - {z | |}^{2}}{2 σ^{2}}) .

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask

©flickr: Justin	The Last Holiday Concert Chapter 1 By Mackenzie Wilcox Start Quiz
	21 Biology 21 Viruses MCQ By OpenStax Start Quiz
	16 Biology 16 Gene Expression MCQ By OpenStax Start Quiz
	13 AP Key Terms 13 Anatomy of the Nervous System By OpenStax Start Key Terms
©flickr: Abraham	1 Biology 1 By Sarah Warren Start Test
	NCE Ch 08 Appraisal By Anh Dao Start Quiz
	12 AP 12 Nervous System Essay By OpenStax Start Flashcards
	28 Biology 28 Invertebrates MCQ By OpenStax Start Quiz
	34 Biology 34 Animal Nutrition Digestive System By OpenStax Start Quiz
	4 Arts Society: Theater 4 By Jonathan Long Start Quiz