1. Introduction

If you’ve learnt or used support vector machines before, then you no doubt would have heard the terms “kernel trick” or “kernel methods”. Most machine learning resources I’ve seen sort of just tell you that the kernel trick is a thing, and for good reason, because explaining it requires a long digression from machine learning and into functional analysis, as will you soon see. And so if you’re like me, you’re left on your own to wade through mathematical texts to piece together an exposition that convinces yourself that the kernel trick really does work. I wrote this blogpost to share how I came to understand the so-called reproducing kernel Hilbert space — the mathematical concept that is responsible for making the kernel trick work.

I assume that you are somewhat familiar with the basic concepts of linear algebra, because we’re going to build on them. Also, I should caveat that here we’re going to sacrifice (a lot of) rigour to develop a high-level understanding of reproducing kernel Hilbert spaces. The goal here is not to “proof” our way through, but to draw a roadmap of the relevant mathematical tools that one needs to understand why the kernel trick works. I will be immensely happy if, at any point along the way, you find a particular concept interesting enough to pursue more rigorous texts on your own.

2. Basic definitions

Our goal here is to start with plain old vector spaces and build up just enough structure in these spaces to arrive at a reproducing kernel Hilbert space. It’s worth noting that all of the definitions here apply to vector spaces over both the field of real numbers ($\mathbb{R}$) and the field of complex numbers ($\mathbb{C}$). Because machine learning deals with real-valued attributes, to make our lives simpler, we limit the rest of this blogpost to vector spaces over $\mathbb{R}$, and ignore $\mathbb{C}$ completely.

Our first definition has to do with forming some notion of length for the elements of a vector space.

Definition 2.1. Let $V$ be a vector space over $\mathbb{R}$. A norm on $V$ is a function $\|\cdot\|: V \rightarrow \mathbb{R}$ such that for any two vectors $v, w \in V$ and scalar $\alpha\in\mathbb{R}$,

$\|v\| \geq 0$ with $\|v\| = 0$ iff $v = \vec{0}$,
$\|\alpha v\| = |\alpha|\|v\|$, and
$\|v + w\| \leq \|v\| + \|w\|$.

The first requirement formalizes our intuition that length cannot be negative, and that an object can have zero length if and only if it is the zero vector. The second requirement states that scaling a vector by a constant scales the length by the same constant, and the third is simply the triangle inequality.

A norm that you probably have seen time and time again (but never called a norm) is the usual absolute value $|\cdot|$ on the vector space $V = \mathbb{R}$. Here, $|\cdot|$ simply gives the distance between an element and the origin on the real number line, or more simply, length. For the general case where $V = \mathbb{R}^n$, the Euclidean norm

$\|v\|_2 = \sqrt{v_1^2 + v_2^2 + ... + v_n^2}$

gives the same notion of the length of $v$. In the case of more abstract vector spaces whose elements are, say, functions (as in $L^p$ spaces, if you happen to know them), a norm could involve taking a definite integral to produce a real number. Although we won’t be looking at norms of functions, we will be using vector spaces of functions later on, so it’s helpful to start thinking of functions as vectors early on.

The next definition has to do with whether a vector space that is equipped with a norm has the usual limiting properties that make the techniques of calculus possible within the space.

Definition 2.2. Let $V$ be a vector space equipped with a norm $\|\cdot\|$ . We say that $V$ is complete with respect to $\|\cdot\|$ if every Cauchy sequence in $V$ converges to a vector $v\in V$.

Without going too deep into mathematical analysis (and fussing over making this definition more precise), let’s try to understand the intuition behind what makes a vector space complete. This is perhaps easier if we try to imagine what a vector space that isn’t complete might look like, say, the set of rational numbers $\mathbb{Q}$. We can construct the following infinite sequence

\begin{array}{cccccc} 1, &1.4, &1.41, &1.414, &1.4142, &… \end{array}

whose elements are all in $\mathbb{Q}$, because they all have finite decimal expansions. This sequence converges to $\sqrt{2}$ and so, using without proof the fact that every convergent sequence is also a Cauchy sequence, this sequence is also Cauchy. However, $\sqrt{2}\not\in\mathbb{Q}$, and so we have found one sequence comprising elements of $\mathbb{Q}$ that converges to some element that is not in $\mathbb{Q}$. One such sequence is enough for us to assert that $\mathbb{Q}$ is not complete.

Now, think about what would make $\mathbb{Q}$ complete. Essentially, there are these “holes” — the irrational numbers — that some sequence converges to, but which are not members of the set $\mathbb{Q}$. We could fill in these “holes” by adding to $\mathbb{Q}$ all the irrational numbers. But that gives us $\mathbb{R}$, which, as you probably guessed by now, is a complete vector space.

As far as our end goal is concerned, whether or not a vector space is complete is a technical subtlety that we won’t emphasize too much. It contributes to one of our later definitions and so I’ve included it for completeness’ sake (pun not intended). A vector space that is complete with respect to its norm is known as a Banach space. Comparing a Banach space with the plain old vector space we started with, the former is equipped with both a norm and certain limiting operations that allow the use of calculus — quite a fair bit of additional structure, but we can do better!

Definition 2.3. Let $V$ be a vector space over $\mathbb{R}$. A (real) inner product on $V$ is a function $\langle\cdot,\cdot\rangle: V\times V \rightarrow \mathbb{R}$ such that for any vectors $u,v,w\in V$ and scalars $\alpha,\beta\in \mathbb{R}$, the following properties hold:

\begin{array}{rll} \text{i.} & \langle u,v \rangle = \langle v,u \rangle & \text{(symmetry)} \cr \text{ii.} & \langle \alpha u + \beta v, w\rangle = \alpha\langle u,w\rangle + \beta\langle v,w\rangle & \text{(linearity in the first argument)} \cr \text{iii.} & \langle v,v\rangle \geq 0 \text{ with } \langle v,v\rangle = 0 \text{ iff } v = \vec{0}. & \text{(positive semi-definiteness)} \end{array}

A vector space that is equipped with an inner product is called an inner product space.

Unlike the previous two definitions, we will be using inner products over and over again, and so it’s worth spending some time to grasp this concept well. To spell out Definition 2.3, what an inner product does essentially is to take as input two vectors that live in the space $V$, and output a scalar — in this case a real number. Although a little abstract, it pays to think of an inner product as a map between two spaces, as the figure below illustrates. One technical subtlety: while we define an inner product on a space like $V$, the inner product actually transforms the space $V\times V$, which is first obtained by taking the Cartesian product of $V$ with itself.

An example of an inner product on $V = \mathbb{R}^n$ is the usual dot product $u\cdot v = \sum_{i=1}^n u_iv_i$, which we should be familiar with. To see why, we check each of the above three properties.

\begin{array}{rll} \text{i.} & \text{(symmetry)} & u\cdot v = \sum_{i=1}^n u_iv_i = \sum_{i=1}^n v_iu_i = v\cdot u \cr \text{ii.} & \text{(linearity)} & \begin{align}(\alpha u + \beta v) \cdot w &= \sum_{i=1}^n (\alpha u_i + \beta v_i)w_i \cr &= \alpha\sum_{i=1}^n u_iw_i + \beta\sum_{i=1}^n v_iw_i = \alpha (u\cdot w) + \beta (v\cdot w)\end{align} \cr \text{iii.} & \text{(positive semi-} & v\cdot v = \sum_{i=1}^n v_i^2 \geq 0, \text{ and } \cr & \text{definiteness)} & v\cdot v = 0 \iff \sum_{i=1}^n v_i^2 = 0 \iff v_i = 0 \text{ for all } i \iff v = \vec{0}. \end{array}

The above arguments show that the dot product is a valid inner product for $\mathbb{R}^n$, which makes $\mathbb{R}^n$ an inner product space. If you have knowledge of $L^p$ spaces, then you probably know that for the space

$L^2 [a,b] = \{f | f \text{ is a measurable function and } |f|^2 \text{ is Lebesgue integrable on } [a,b]\},$

we can define the inner product on two functions $f,g \in L^2[a,b]$ by

$\langle f,g \rangle = \int_a^b fg.$

While verifying the three requirements above is straightforward using the properties of Lebesgue integration, some care must be taken to ensure that $\int_a^b fg$ is finite, for otherwise $\langle f,g\rangle$ would map to an element outside of $\mathbb{R}$. Hölder’s Inequality, if it is available to you, helps us guarantee this, and we can conclude that the space of all $L^2$ functions defined on an interval $[a,b]$ is also an inner product space. If none of this talk on $L^p$ spaces is making sense, this is just an aside so you’re fine.

You may be familiar with the following equation that relates the inner product of two vectors and the acute angle ($\theta$) between them:

$\|u\|\|v\|\cos\theta = \langle u,v \rangle.$

Whereas the norm gave us notions of length and distance, the inner product introduces a way to measure angles between vectors — yet another geometric structure that we can endow vector spaces with. The above equation also describes a nice relationship between an inner product and norms in the same space. It turns out that for inner product spaces, we can define a norm in a more natural way.

Definition 2.4. Let $V$ be a vector space equipped with an inner product $\langle\cdot,\cdot\rangle: V\times V \rightarrow \mathbb{R}$. For $v\in V$, define the norm induced by the inner product as

$\|v\| = \sqrt{\langle v,v \rangle}.$

The Euclidean norm, introduced earlier, is one such norm on $\mathbb{R}^n$ that follows naturally from the dot product. Notice that I used the words one such norm. Indeed, a vector space can have more than one valid norm and in fact, one can define an infinite family of norms on $\mathbb{R}^n$ by

$\|v\|_p = \left(\sum_{i=1}^n |v_i|^p \right)^{1/p}.$

This is known as the $p$-norm, of which the Euclidean norm is a special case that arises when $p=2$. The $p$-norm gives us ways of measuring distances in all sorts of fascinating ways that show up everywhere in machine learning, but we won’t delve into $p$-norms here.

Finally, after incorporating more richness in our spaces layer after layer, we are able to define a Hilbert space — the latter half of the reproducing kernel Hilbert space which we set forth to explore in this blogpost.

Definition 2.5. A Hilbert space is an inner product space that is complete with respect to the norm induced by the inner product. We denote a Hilbert space by $\mathcal{H}$.

You can see why we had to go through four definitions just to get here. A Hilbert space incorporates ideas from all of our previous definitions and so, in a sense, is thus far the space with the richest structure. In Hilbert spaces we find the geometric notions of length, distance, and angle. The techniques of calculus also apply thanks to completeness, and with slight extensions, one can even define probability measures on Hilbert spaces. At the same time, there are no restrictions on what the vectors in the Hilbert space have to look like — they could be three-dimensional vectors, or functions, or infinite sequences. For these reasons, Hilbert spaces are said to generalize the Euclidean spaces $\mathbb{R}^n$.

Recall that a Banach space is a vector space that is complete with respect to its norm. Some texts distinguish between Banach spaces and Hilbert spaces by the former’s lack of an inner product. The following Venn diagram summarises the various types of spaces that we’ve learnt so far. Spaces that are closer to the center are in a sense more “rich”. Reproducing kernel Hilbert spaces, as the name suggests, are a special type of Hilbert space, and is the topic of the next section.

3. Reproducing kernel Hilbert space

At long last, we are now in a position to understand reproducing kernel Hilbert spaces. There are many ways to go about defining a reproducing kernel Hilbert space. Here, I give the exposition which I personally find the most straightforward, but encourage you to dive deeper into other perspectives for a fuller understanding.

Definition 3.1. Let $\mathcal{X}$ be an arbitrary set, and $\mathcal{H}$ a Hilbert space of all functions $f: \mathcal{X}\rightarrow\mathbb{R}$. For each element $x\in\mathcal{X}$, the evaluation functional is a linear functional that evaluates each $f\in\mathcal{H}$ at the point $x$, written

$\mathcal{L}_x: \mathcal{H}\rightarrow\mathbb{R}, \text{ where } \mathcal{L}_x(f) = f(x) \text{ for all } f\in\mathcal{H}.$

We say that $\mathcal{H}$ is a reproducing kernel Hilbert space (RKHS) if, for all $x\in\mathcal{X}$, $\mathcal{L}_x$ is continuous at every $f\in\mathcal{H}$.

That was a highly intimidating and abstract definition, so let’s break it down and start relating it to machine learning. The following illustration may help.

First, Definition 3.1 places no restriction on the set $\mathcal{X}$. In machine learning problems, we think of each element of $\mathcal{X}$ as a vector of attributes, which may be real numbers, strings, graphs, etc. For example, given $M$ training points and $N+1$ attributes (including the bias), the attributes of the $i$-th training point may be written

$x^{(i)} = \left[\begin{array}{cccc} x_0^{(i)} &x_1^{(i)} &\cdots &x_N^{(i)}\end{array}\right] \in \mathcal{X}.$

As the diagram makes explicit, the functions $f$ map each $x\in\mathcal{X}$ onto a real number. I should point out a subtle but important notational difference: $f$, written without an input, refers to functions that live in the Hilbert space $\mathcal{H}$, whereas $f(x)$, written with the input, refers to the evaluation of the function $f$ at a point $x$ in its domain, which produces a real number.

Although it may be tempting to do so, do not confuse $f$ with the hypothesis function $h$ typically used in supervised machine learning; hypotheses functions relate the observed attributes $x\in\mathcal{X}$ to the true output values $y\in \mathcal{Y}$, whereas the functions $f$ as defined here bear no obvious relevance to machine learning problems (please correct me if I’m wrong).

The evaluation functional $\mathcal{L}_x$ is just a fancy name for the following idea: you hand me an $x\in\mathcal{X}$ and given that $x$, I can take any function $f\in\mathcal{H}$ and evaluate $f$ at that $x$ to produce a real number $f(x)$. Notice again how $f$ and $f(x)$ mean different things.

According to Definition 3.1, $\mathcal{H}$ is a reproducing kernel Hilbert space (RKHS) if for all $x\in\mathcal{X}$, the linear functional $\mathcal{L}_x$ is continuous at every $f\in\mathcal{H}$. This might leave you a bit confused. What does it mean for $\mathcal{L}_x$ to be continuous at a function $f$? More importantly, what is this “reproducing kernel”, and why does the definition make no mention of kernels to begin with? It turns out that by applying something called the Riesz representation theorem to our continuous linear functionals $\mathcal{L}_x$, we can obtain the following, somewhat more intuitive result.

Corollary 3.2. Let $\mathcal{X}, f, \mathcal{H}$ and $\mathcal{L}_x$ be defined as in Definition 3.1. If every $\mathcal{L}_x$ is continuous at every $f\in\mathcal{H}$, then for each $\mathcal{L}_x$, there is a unique function $K_x\in\mathcal{H}$ such that for every $f\in\mathcal{H}$,

$\mathcal{L}_x(f) = f(x) = \langle f,K_x \rangle_\mathcal{H}.$

This equation is known as the reproducing property.

Notice that I’ve subscripted $\langle \cdot,\cdot \rangle$ with $\mathcal{H}$ to make it clear that we are taking an inner product on $\mathcal{H}$. Another diagram is probably in order.

Essentially, Corollary 3.2 says that given an evaluation functional $\mathcal{L}_x$, we can find a unique function $K_x\in\mathcal{H}$ such that we can represent the action of $\mathcal{L}_x$ on any function $f$ (blue path) using an inner product between $f$ and $K_x$ (red path). Notice that if we wanted to represent a different evaluation functional $\mathcal{L}_y$, we would need to find another unique function $K_y$ for the representation to work.

Now comes a little bit of sophistry. First, we note that the reproducing property given in Corollary 3.2 holds for any $x\in\mathcal{X}$. In particular, we can replace $x$ with $z$ to get

$\mathcal{L}_z(f) = f(z) = \langle f,K_z \rangle_\mathcal{H}$

for all $f\in\mathcal{H}$. Now, because the $K_x$ that we found earlier also lives in $\mathcal{H}$, the above equation should also work for $f=K_x$, giving

$\mathcal{L}_z(K_x) = K_x(z) = \langle K_x,K_z \rangle_\mathcal{H}.$

Henceforth, we can ignore the leftmost expression $\mathcal{L}_z(K_x)$ because we’ve represented the evaluation functional with the inner product on $\mathcal{H}$. To understand the rest of the expression $K_x(z) = \langle K_x,K_z \rangle_\mathcal{H}$ , we can replace $f$ with $K_x$ and $x$ with $z$ in the previous diagram to get the following:

Thus far, $x$ has been fixed. But since the above diagram works for any $x\in\mathcal{X}$, we can let $x$ vary and rewrite $K_x(z) = \langle K_x,K_z \rangle_\mathcal{H}$ as

$K(x,z) = \langle K_x, K_z \rangle_\mathcal{H}.$

This is the reproducing kernel that we’ve been looking for, so named because we constructed it by applying the reproducing property twice.

Definition 3.3. Let $\mathcal{X}$ be an arbitary set, and $\mathcal{H}$ a Hilbert space of all functions $f:\mathcal{X}\rightarrow\mathbb{R}$. If, for all $x\in\mathcal{X}$, the linear evaluation functional $\mathcal{L}_x:\mathcal{H}\rightarrow\mathbb{R}$ is continuous at every $f\in\mathcal{H}$, we can construct the reproducing kernel, which is a bivariate function $K:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ defined by

$K(x,z) = \langle K_x,K_z \rangle_\mathcal{H},$

and the Hilbert space $\mathcal{H}$ is called a reproducing kernel Hilbert space (RKHS).

A reproducing kernel Hilbert space is therefore a Hilbert space with an additional structure associated with it — the reproducing kernel. In the next section, we see how this additional structure allows us to exploit the kernel trick in machine learning problems.

4. The kernel trick

To get to the kernel trick, we need to introduce one more function (this is the last!). Given the RKHS $\mathcal{H}$, we can find a function $\varphi:\mathcal{X}\rightarrow\mathcal{H}$ that “links” up $\mathcal{X}$ and $\mathcal{H}$ in the diagram above. A straightforward way to do this is to define

$\varphi(x) = K_x \text{ for all } x\in\mathcal{X}.$

We are guaranteed that we can do this for every $x\in\mathcal{X}$ thanks to Corollary 3.2. In machine learning terms, $\varphi$ is frequently referred to as the feature map, which maps from the space of attributes $\mathcal{X}$ to the feature space $\mathcal{H}$. Using the definition of a reproducing kernel from Definition 3.3 and replacing $K_x$ with $\varphi(x)$, and $K_z$ with $\varphi(z)$, we get the familiar kernel trick

$K(x,z) = \langle \varphi(x),\varphi(z) \rangle_\mathcal{H}.$

To fully understand the statement that the kernel trick makes, let’s visualize all of the functions we’ve defined so far on a single diagram.

The above diagram shows intuitively that computing the kernel of two vectors of attributes $x$ and $z$ (blue route) is equivalent to first mapping $x$ and $z$ into the feature space $\mathcal{H}$, and then taking the inner product on this feature space between $\varphi(x)$ and $\varphi(z)$. And this should align very well with your understanding of what the kernel trick accomplishes: using a reproducing kernel in place of an inner product between vectors of a high (or potentially infinite) dimensional feature space.

An example should hopefully make the kernel trick more concrete. For dimension $N$ and degree $d$, define the polynomial kernel $K:\mathcal{X}\times\mathcal{X}\rightarrow \mathbb{R}$ by

$K(x,z) = (x^T z + c)^d,$

where $c\geq 0$. Let $N=3$ and $d=2$ for ease of computation. I claim that the resulting quadratic kernel is a reproducing kernel for some corresponding Hilbert space (more on why later). Then for any two $x,z\in\mathbb{R}^3$,

$\begin{align*} K(x,z) &= \left(\sum_{i=1}^3 x_iz_i + c \right)^2 \cr &= x_1^2 z_1^2 + (x_1x_2)(z_1z_2) + (x_1x_3)(z_1z_3) + cx_1z_1 \cr &\hspace{4 em} + (x_2x_1)(z_2z_1) + x_2^2z_2^2 + (x_2x_3)(z_2z_3) + cx_2z_2 \cr &\hspace{8 em} + (x_3x_1)(z_3z_1) + (x_3x_2)(z_3z_2) + x_3^2z_3^2 + cx_3z_3 \cr &\hspace{12 em} + cx_1z_1 + cx_2z_2 + cx_3z_3 + c^2 \cr &= x_1^2 z_1^2 + 2(x_1x_2)(z_1z_2) + 2(x_1x_3)(z_1z_3) + 2x_1z_1 \cr &\hspace{4 em} + x_2^2z_2^2 + 2(x_2x_3)(z_2z_3) + 2cx_2z_2 \cr &\hspace{8 em} + x_3^2z_3^2 + 2cx_3z_3 + c^2\cr &= \left[\begin{array}{c} x_1^2\cr \sqrt{2}x_1x_2\cr \sqrt{2}x_1x_3\cr \sqrt{2c}x_1\cr x_2^2\cr \sqrt{2}x_2x_3\cr \sqrt{2c}x_2\cr x_3^2\cr \sqrt{2c}x_3\cr c \end{array}\right] \cdot \left[\begin{array}{c} z_1^2\cr \sqrt{2}z_1z_2\cr \sqrt{2}z_1z_3\cr \sqrt{2c}z_1\cr z_2^2\cr \sqrt{2}z_2z_3\cr \sqrt{2c}z_2\cr z_3^2\cr \sqrt{2c}z_3\cr c \end{array}\right] = \langle \varphi(x),\varphi(z) \rangle_{\mathbb{R}^{10}}, \end{align*}$

which shows that computing the quadratic kernel on $x$ and $z$ is equivalent to first transforming the three-dimensional attributes $x$ and $z$ into the ten-dimensional features $\varphi(x)$ and $\varphi(z)$, followed by an inner product between $\varphi(x)$ and $\varphi(z)$ in the feature space $\mathbb{R}^{10}$. It turns out that in general, computing a polynomial kernel as defined above corresponds to taking an inner product in a feature space of dimension $(N+d)$ choose $d$.

To see the whole point of the kernel trick, let’s perform some basic complexity analysis for the quadratic case where $d=2$. To compute the feature vector

$\varphi(x) = \left[\begin{array}{cccccccccc} x_1^2 &\sqrt{2c}x_1 &\sqrt{2}x_1x_2 &\sqrt{2}x_1x_3 &x_2^2 &\sqrt{2c}x_2 &\sqrt{2}x_2x_3 &x_3^2 &\sqrt{2c}x_3 &c \end{array}\right]^T$

for any $x\in\mathbb{R}^N$, one might use an algorithm like the following:

phi = []
for i from 1 to N:
    phi.append(x[i]**2)
    phi.append(sqrt(2*c)*x[i])
    for j from i+1 to N: 
        phi.append(sqrt(2)*x[i]*x[j])
phi.append(c)

If we count each append operation as an elementary step, the nested for loops imply the following number of steps:

$\def\arraystretch{1.5} \begin{array}{lll} i=1 &\hspace{1em}2+(N-2)+1 &\hspace{1em}=N+1 \cr i=2 &\hspace{1em}2+(N-3)+1 &\hspace{1em}=N \cr i=3 &\hspace{1em}2+(N-4)+1 &\hspace{1em}=N-1 \cr \hspace{1em}\vdots &\hspace{2em}\vdots &\hspace{2em}\vdots \cr i=N &\hspace{1em}2+(N-(N+1))+1 &\hspace{1em}=2 \end{array}$

Taking into account the final append operation, the time complexity of computing the feature vector $\varphi(x)$ is therefore

$\left(\sum_{k=2}^{N+1} k\right) + 1 = \left(\sum_{k=1}^{N+1} k\right) - 1 + 1= \frac{1}{2}(N+1)(N+2) = \mathcal{O}(N^2),$

where $N$ is the dimension of the attribute space. Taking the inner product of two feature vectors $\varphi(x)$ and $\varphi(z)$ is clearly linear in the dimension of the feature space (recall that this is $(N+d)$ choose $d$), and so the corresponding time complexity is

$\mathcal{O}\left({N+2}\choose{2}\right) = \mathcal{O}\left(\frac{(N+2)!}{N!\cdot 2!}\right) = \mathcal{O}\left(\frac{1}{2}(N+2)(N+1)\right) = \mathcal{O}(N^2).$

Hence, the overall time complexity of applying the feature map followed by the inner product is

$\mathcal{O}(N^2) + \mathcal{O}(N^2) = \mathcal{O}(N^2).$

What about the time complexity of computing $K(x,z)$? Although the calculation above seemed long, notice that to compute

$K(x,z) = (x^Tz + c)^2 = (\langle x,z \rangle + c)^2,$

we would need to take the inner product between $x$ and $z$, an $\mathcal{O}(N)$ operation, add a constant and then square the resulting value, both of which are $\mathcal{O}(1)$ operations. Hence, the overall time complexity of computing $K(x,z)$ is

$\mathcal{O}(N) + \mathcal{O}(1) + \mathcal{O}(1) = \mathcal{O}(N),$

which grows slower than the $\mathcal{O}(N^2)$ needed to compute $\langle\varphi(x),\varphi(z)\rangle$.

5. Positive semi-definite kernels

In the previous section, I claimed that the quadratic kernel is a reproducing kernel for some corresponding RKHS. This is related to a property present in some kernels.

Definition 5.1. Let $\mathcal{X}$ be an arbitrary non-empty set. A kernel $K:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ is positive semi-definite if

$K$ is symmetric i.e. $K(x_i,x_j)=K(x_j,x_i)$ for all $x_i,x_j\in\mathcal{X}$, and
the square matrix defined by $\kappa_{ij} = K(x_i,x_j)$ for $i=1,2,...,m$ is positive semi-definite i.e. for all $v\in\mathbb{R}^m, v^T \kappa v \geq 0$.

To familiarize ourselves with this definition, let us now show that the polynomial kernel defined by $K(x,z) = (x^Tz + c)^d$, where $c\geq 0$, is positive semi-definite. That the quadratic kernel is positive semi-definite will then follow from the general case. To show symmetry, we can use the symmetry of the inner product to assert that

$K(x,z) = (x^Tz + c)^d = (\langle x,z \rangle + c)^d = (\langle z,x \rangle + c)^d = (z^Tx + c)^d = K(z,x).$

To show the second condition in Definition 5.1, we first consider the case where $c=0$. For all $i,j$, we can write

$\kappa_{ij} = K(x_i,x_j) = (x_i^T x_j)^d = \underbrace{(x_i^T x_j)(x_i^T x_j)\cdots (x_i^T x_j)}_{d \text{ times}},$

which allows us to “factorize” the matrix $\kappa$ as

$\kappa = \underbrace{ \left[\begin{array}{cccc} x_1^T x_1 &x_1^T x_2 &\cdots &x_1^T x_m \cr x_2^T x_1 &x_2^T x_2 &\cdots &x_2^T x_m \cr \vdots &\vdots &\ddots &\vdots \cr x_m^T x_1 &x_m^T x_2 &\cdots &x_m^T x_m \end{array}\right] \circ \cdots \circ \left[\begin{array}{cccc} x_1^T x_1 &x_1^T x_2 &\cdots &x_1^T x_m \cr x_2^T x_1 &x_2^T x_2 &\cdots &x_2^T x_m \cr \vdots &\vdots &\ddots &\vdots \cr x_m^T x_1 &x_m^T x_2 &\cdots &x_m^T x_m \end{array}\right] }_{d \text{ times}},$

where $\circ$ represents the entry-wise product between two matrices. Let $A$ represent one of the identical matrices in the product above. Our goal for now is to show that $A$ is positive semi-definite, so we let $v\in\mathbb{R}^m$ be given. Then,

$\begin{align*} v^T A v &= \sum_{i=1}^m v_i \left(\sum_{j=1}^m A_{ij} v_j \right) \cr &= \sum_{i=1}^m \sum_{j=1}^m A_{ij} v_i v_j \cr &= \sum_{i=1}^m \sum_{j=1}^m x_i^T x_j v_i v_j \cr &= \sum_{i=1}^m \sum_{j=1}^m \sum_{k=1}^n x_{ik} x_{jk} v_i v_j \cr &= \sum_{k=1}^n \left(\sum_{i=1}^m x_{ik} v_i \right) \left(\sum_{j=1}^m x_{jk} v_j \right) \cr &= \sum_{k=1}^n \left(x_k^T v\right) \left(x_k^T v\right) \cr &= \sum_{k=1}^n \left(x_k^T v\right)^2 \geq 0, \end{align*}$

and because $v$ was arbitrary, we can assert that $A$ is positive semi-definite. Now, using the fact without proof that the entry-wise product of two positive semi-definite matrices is also positive semi-definite (called the Schur product theorem), we see that

$\kappa = \underbrace{A \circ A \circ \cdots \circ A}_{d \text{ times}}$

is positive semi-definite, which completes the case for $c=0$. For the case where $c > 0$, we can use a similar approach with

$\kappa_{ij} = K(x_i,x_j) = (x_i^T x_j + c)^d = \underbrace{(x_i^T x_j + c)(x_i^T x_j + c)\cdots (x_i^T x_j + c)}_{d \text{ times}},$

which gives the “factorization”

$\kappa = \underbrace{(A + cI) \circ (A + cI) \circ \cdots \circ (A + cI)}_{d \text{ times}}.$

Because the diagonal matrix $cI$ is positive semi-definite (you can easily prove this), and using the fact without proof that the addition of two positive semi-definite matrices is also positive semi-definite, we can assert that $\kappa$ is positive semi-definite in this case where $c>0$. With both cases covered, we have satisfied the second condition in Definition 5.1, and so we conclude that the polynomial kernel is positive semi-definite.

So what is the significance of a kernel being positive semi-definite? This is answered by the Moore-Aronszajn theorem, which states that

Every positive semi-definite kernel is a reproducing kernel for some corresponding reproducing kernel Hilbert space.

And now you know why I could apply the kernel trick with the quadratic kernel in the previous section — that the quadratic kernel is positive semi-definite guarantees that we could find a Hilbert space that corresponded to a feature space, for the kernel trick to work.

6. Application to support vector machines

Recall that a support vector machine (SVM) is a supervised machine learning algorithm that classifies data by fitting a separating hyperplane. To understand what reproducing kernels can do for for SVMs, let’s first recap the idea of non-linearly-separable data. Consider the following (dummy) data that form two concentric circles on a plot.

N = 200                           # 200 points per class 
X = np.zeros(shape=(N*2, 3))      # dummy dataset 
Y = np.repeat([0, 1], repeats=N)  # true class labels 

radius0 = np.random.normal(loc=1, scale=0.15, size=N)    # radius & angle for inner points 
theta0 = np.linspace(start=0, stop=360, num=N)
radius1 = np.random.normal(loc=2, scale=0.25, size=N)    # radius & angle for outer points 
theta1 = np.linspace(start=0, stop=360, num=N)
radius, theta = np.concatenate([radius0, radius1]), np.conca tenate([theta0, theta1])

X[:, 0] = radius * np.cos(theta)   # x1-coordinate 
X[:, 1] = radius * np.sin(theta)   # x2-coordinate 

plt.plot(X[Y==0, 0], X[Y==0, 1], 'bo', markersize=3)
plt.plot(X[Y==1, 0], X[Y==1, 1], 'ro', markersize=3)
plt.xlabel('x1'); plt.ylabel('x2')
plt.show()

It’s easy to see how naively fitting a hyperplane would not do so well at separating the data. Instead, let’s consider transforming our two-dimensional attributes into three-dimensional features using the feature map $\varphi:\mathbb{R}^2 \rightarrow \mathbb{R}^3$ defined by

$\varphi(x) = \left[\begin{array}{ccc}x_1 &x_2 &x_1^2+x_2^2 \end{array}\right]^T,$

and fitting our SVM on the dataset

$\Phi(\mathcal{X}) = \left[\begin{array}{ccc} -\hspace{-1em} &\varphi(x^{(1)}) &\hspace{-1em}- \cr -\hspace{-1em} &\varphi(x^{(2)}) &\hspace{-1em}- \cr &\vdots & \cr -\hspace{-1em} &\varphi(x^{(m)}) &\hspace{-1em}- \end{array}\right].$

X[:, 2] = np.square(X[:, 0]) + np.square(X[:, 1])    # create x3-coordinate 
clf = svm.SVC(kernel='linear')
clf.fit(X, Y)        # fit SVM and store model weights 
coef, intercept = clf.coef_[0], clf.intercept_

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[Y==0, 0], X[Y==0, 1], X[Y==0, 2], 'b')
ax.scatter(X[Y==1, 0], X[Y==1, 1], X[Y==1, 2], 'r')
xx, yy = np.meshgrid(range(-3, 4), range(-3, 4))
z = (-coef[0]*xx - coef[1]*yy - intercept)/coef[2]
ax.plot_surface(xx, yy, z, color='green', alpha=0.3)

ax.view_init(10, 30)
ax.set_xlabel('x1'); ax.set_ylabel('x2'); ax.set_zlabel('x3')
plt.show()

As you can see in the above diagram, by adding a third coordinate, the feature map has made the data linearly-separable in the feature space (well, almost — the training accuracy was 99.5%).

Why does this work? The answer lies in the formulation of the dual optimization problem for the SVM:

$\begin{align*} \max_{\alpha_1,...,\alpha_m} \cr &f(\alpha_1,...,\alpha_m) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m y^{(i)}y^{(j)}\alpha_i\alpha_j\langle x^{(i)},x^{(j)} \rangle_{\mathcal{X}} \cr \text{subject to } \cr &\alpha_i \geq 0 \text{ for all } i=1,2,...,m, \text{ and } \sum_{i=1}^m \alpha_i y^{(i)}=0. \end{align*}$

Don’t worry if you haven’t seen these equations before; the key is to focus on the inner product $\langle x^{(i)},x^{(j)} \rangle_\mathcal{X}$. Notice how the terms that denote the training examples (the $x$’s) appear only in this inner product. In other words, by replacing $\langle x^{(i)},x^{(j)}\rangle_\mathcal{X}$ with $\langle \varphi(x^{(i)}),\varphi(x^{(j))} \rangle_{\mathcal{F}}$ for some suitable feature map $\varphi$, we can train our SVM on the features $\varphi(x)\in\mathcal{F}$, instead of the raw attributes $x\in\mathcal{X}$. And that is precisely what we have done above to classify the concentric-looking data. To summarize,

By transforming our raw attributes into higher-dimensional features, we can make the data more linearly-separable, and therefore improve the performance of the support vector machine.

At this point you’re probably wondering what the reproducing kernel has to do with any of this, given that we were still taking inner products in the feature space for the previous example. Well, the previous example was easy. Here’s a much more difficult classification problem:

N = 200                            # 200 points per class 
X = np.zeros(shape=(N*2, 2))       # dummy dataset 
Y = np.repeat([0, 1], repeats=N)   # true class labels 

radius0 = np.linspace(0, 2, N)     # radius and angle for blue spiral 
theta0 = np.linspace(0, 3*math.pi, N) + np.linspace(0, 0.7, N)*np.random.randn(N)
radius1 = np.linspace(0, 2, N)     # radius and angle for red spiral 
theta1 = np.linspace(math.pi, 4*math.pi, N) + np.linspace(0, 0.7, N)*np.random.randn(N)
radius, theta = np.concatenate([radius0, radius1]), np.concatenate([theta0, theta1])

X[:, 0] = radius * np.cos(theta)   # x1-coordinate 
X[:, 1] = radius * np.sin(theta)   # x2-coordinate 

plt.figure(figsize=(6,4))
plt.plot(X[Y==0, 0], X[Y==0, 1], 'bo', markersize=3)
plt.plot(X[Y==1, 0], X[Y==1, 1], 'ro', markersize=3)
plt.xlabel('x1'); plt.ylabel('x2')
plt.show()

As far as I know, there is no obvious feature map $\varphi$ that can transform the raw data into a linearly-separable version. The classic way to train an SVM on such spiral data is to replace the inner product $\langle x,z \rangle_\mathcal{X}$ with the Gaussian kernel (also called radial basis function kernel)

$K(x,z) = \exp\left(-\frac{\|x-z\|^2}{2\sigma^2}\right),$

where $\|\cdot\|$ is the Euclidean norm and $\sigma$ is a parameter that governs the spread. You can verify that this Gaussian kernel is indeed a valid reproducing kernel by showing that it is positive semi-definite (Definition 5.2).

It turns out that the Hilbert space associated with the Gaussian kernel is infinite-dimensional. In other words, the problem is way more severe than being unable to efficiently compute inner products in a high-dimensional feature space; we wouldn’t even be able to compute the feature vectors in the first place! And despite this, we can still easily compute $K(x,z)$ in linear time. To summarize:

By using a reproducing kernel, we can efficiently compute implicit inner products in a high (or potentially infinite) dimensional feature space, without having to compute the feature vectors themselves.

Of course, the Gaussian kernel represents an extreme case, and not every reproducing kernel corresponds to infinite-dimensional features. Nonetheless, this should give you a sampling of the power of kernels.

And now, to complete the demonstration by fitting our SVM to the spiral data using a Gaussian kernel.

clf2 = svm.SVC(kernel='rbf', gamma=1.0, C=5.0)
clf2.fit(X, Y)
print(clf2.score(X, Y))

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 0].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
xnew = np.c_[xx.ravel(), yy.ravel()]
ynew = clf2.predict(xnew).reshape(xx.shape)

plt.figure(figsize=(6,4))
plt.plot(X[Y==0, 0], X[Y==0, 1], 'bo', markersize=3)
plt.plot(X[Y==1, 0], X[Y==1, 1], 'ro', markersize=3)
plt.xlabel('x1'); plt.ylabel('x2')
CM = plt.cm.get_cmap('Paired')
CM._init()
CM._lut	[:,-1] = 0.2
plt.set_cmap(CM)
plt.pcolormesh(xx, yy, ynew)
plt.scatter(X[:,0], X[:,1], c=Y)
plt.show()

As you can see in the above diagram, the SVM has managed to fit a highly non-linear decision boundary pretty well (training accuracy was 92%). Thanks to the kernel trick, it has managed to efficiently learn a decision boundary in an infinite dimensional feature space with only 200 training examples per class!

Although this section discussed SVMs, the kernel trick can be applied to any algorithm in which the training examples $x^{(i)}$ appear only in an inner product, and nowhere else. In these so-called kernel methods, finding the right reproducing kernel to use for the particular algorithm and dataset at hand is key to successful classification or clustering.