Introduction to Gaussian Processes
Introduction to Gaussian Processes
Gaussian processes (GP) are a cornerstone of modern machine learning. They are often used for non-parametric regression and classification, and are extended from the theory behind Gaussian distributions and Gaussian mixture models (GMM), with strong and interesting theoretical ties to kernels and neural networks. While Gaussian mixture models are used to represent a distribution over values, Gaussian processes are a distribution over functions. This is easy enough to say, but what does it really mean? Let's take a look.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.RandomState(1999)
n_samples = 1000
X = rng.rand(n_samples)
y = np.sin(20 * X) + .05 * rng.randn(X.shape[0])
X_t = np.linspace(0, 1, 100)
y_t = np.sin(20 * X_t)
plt.scatter(X, y, color='steelblue', label='measured y')
plt.plot(X_t, y_t, linestyle='-', color='darkred', label='true y')
plt.title('Noisy Example Function')
plt.legend(loc='lower left')
Here we have a set of values, \(X\), and another set of values \(y\). The values of \(X\) are related to \(y\) by a function \(f(x)\), which is described by the equation \(y = sin(C * X) + \varepsilon\), where \(\varepsilon\) is some noise (in this case Gaussian noise with a variance of .05) and \(C\) is some constant (in this case, 20, increasing the frequency of the oscillations so things look nice).
This means that for any value \(x\) we put into the function, we get out some new value \(y\). If we did not know \(f(x)\), and were only given \(X\) and \(y\), we would be very interested in learning \(f(x)\) from the data. If we learned it perfectly, then we would be able to accurately predict any new \(y\) given an input \(x\).
This may not seem exciting, because this particular \(f(x)\) is boring. But imagine our \(f(x)\) is something complicated, like a price on the stock market, or energy demand, or the probability of being struck be lightning at a given a location... it becomes a lot more interesting to learn \(f(x)\) from data! This is the general motivation behind many machine learning tasks, but this definition of learning the "most likely generating function" has a special importance for the Gaussian process.
In the plot above, the blue values represent data that has been measured, while the red value indicates the true generating function. We can see that the red values are the mean of this particular function, while the errors around the red line (where the blue points fall) represents the covariance of this particular function.
rng = np.random.RandomState(1999)
n_samples = 1000
X = rng.rand(n_samples)
y = np.sin(20 * X) + .05 * rng.randn(X.shape[0])
plt.scatter(X, y, color='steelblue')
plt.title('Noisy Data')
Now imagine a case like the above, where the red line values are unknown. We have points \(X\), and measurements from those points \(y\). We can also look at the graph and approximate the red line from the previous graph running through the center of the blue points. If we do this procedure in a mathematical way, we are learning \(f(x)\) from the data!
This is basically how estimating the mean function as a Gaussian processes works - given a set of existing points, we have mathematical tools for estimating the mean and covariance function for this particular set of data. We are also able to use our prior information (things like: this function repeats, \(X\) values near each other generate \(y\) values near each other, etc.) by picking certain formulas to use for the covariance function during the estimation process.
However, there is a problem - if our measurements are very noisy it may be very difficult (or impossible!) to figure out \(f(x)\).
rng = np.random.RandomState(1999)
n_samples = 1000
X = rng.rand(n_samples)
y = np.sin(20 * X) + .95 * rng.randn(n_samples)
plt.scatter(X, y, color='steelblue')
plt.title('Really Noisy Data')
Looking at the above plot, it is easy to see that generating the "red line" like above would be much more difficult, even though the generating function \(sin()\) is the same. In a sense, you could say that the distribution of possible functions to generate those \(y\) values from \(X\) is very wide, and it is hard to find the "best guess" for \(f(x)\).
Well Isn't That Special
This is exactly what is meant by Gaussian processes are distributions over functions. Like a regular Gaussian distribution (or multivariate Gaussian) which is fully defined by it's mean and covariance, a Gaussian process is fully defined by it's mean function \(m(x)\) and covariance function \(K(x, x')\).
This covariance function (also called a kernel or correlation function in a bunch of other literature!) gives the pairwise distance between all combinations of points. I will use the name covariance function from here on, but it is important to know that covariance function, correlation function, and kernel function are used semi-interchangeably in the existing papers and examples! My thought is that a covariance function uses a kernel function to compute the variance in some kernel space - so you will see a function name covariance that takes a kernel argument later in the code. A great link on this (courtesy of mcoursen) is here. Lets walk through a simple example, modified from Christopher Fonnesbeck's code for Bios366.
We will need to start with some "initial guess" for both the mean function and the covariance function. The simplest guess is 0 mean, with some covariance defined by taking our kernel function at \(0\). Though there are many different kernel functions, the exponential kernel is usually one of the first to try. I will be covering kernels in more detail in both this post and a followup, but to keep things simple we will gloss over the details.
Ultimately, a kernel is simply a function that takes in two matrices (or vectors) and compares the distances between every sample in some space. In a linear space, this is as simple as np.dot(x, x.T) if x is a rows-as-samples matrix. The exponential kernel measures distances in a non-linear space, defined by a Gaussian transformation. This sounds pretty complicated, but thinking of these kernels as black box functions that return distances between samples is good enough to get through this post.
Our initial mean function is simply \(0\), and the correlation function gives an initial condition by calculating covariance(kernel, 0, 0). Using this as a starting place, we can visualize our initial function estimate, given no information besides a choice for the kernel.
# from mrmartin.ner/?p=223
def exponential_kernel(x1, x2):
# Broadcasting tricks to get every pairwise distance.
return np.exp(-(x1[np.newaxis, :, :] - x2[:, np.newaxis, :])[:, :, 0] ** 2).T
# Covariance calculation for a given kernel
def covariance(kernel, x1, x2):
return kernel(x1, x2)
rng = np.random.RandomState(1999)
# Initial guess
kernel = exponential_kernel
init = np.zeros((1, 1))
sigma = covariance(kernel, init, init)
xpts = np.arange(-3, 3, step=0.01).reshape((-1, 1))
plt.errorbar(xpts.squeeze(), np.zeros(len(xpts)), yerr=sigma.squeeze(), capsize=0, color='steelblue')
plt.ylim(-3, 3)
plt.title("Initial guess")
Now that we have initialized the GP, we want to estimate a new \(y\) given a new input \(x\). Without any prior knowledge our guess will not be very good, which is represented by the wide blue line across the plot (our confidence bounds). Luckily, we have a set of \(x\) values that are paired with \(y\) values , called our training set, which we can use to learn a possible model. To make these updates, we will need a new tool: the conditional distribution.
The conditional formula is fairly straightforward mathematically, and is seen in many other works. For a full derivation, see the slides here or the tutorial here. I will simply state the key mathematics, and show code to compute it.
Conditionals of My Parole
One of the key formulas for the Gaussian process is the conditional function for multivariate Gaussian distributions. This is quite a mouthful, but the idea boils down to "Given my old x, and the y values for those x, what do I expect a new y (y*) to be for a new x (x*)?".
If we have no data, we have no idea what y* can be.
x* = 3, what is y*?
I have no idea, and this is a terrible question
We are then given the following information:
When x = 1, y = 2
When x = 2, y = 4
When x = 4, y = 8
When x = 5, y = 10
With a lot of prior data for x and y, we start to have a pretty strong intuition about a new y (y*) when given a new x (x*). Now, if asked again, what is your best guess for y*?
x* = 3, what is y*?
My best guess would be y* = 6
Technically, y* could be anything but judging by the past results, y* = 6 seems to be a reasonable guess.
The mathematical formula for this conditional distibution, with some Gaussian assumptions (this is assumed to be a Gaussian process after all) is shown below. There are more complicated derivations, but since we typically assume \(\mu_x\) and \(\mu_y\) are both \(0\), this equation is pretty straightforward.
\(p(y_* | x_*, x, y) = \mathcal{N}(\Sigma_{xx_*}\Sigma_x^{-1}y, \Sigma_{x_*}-\Sigma_{xx_*}\Sigma_x^{-1}\Sigma_{xx_*}^T)\)
The conditional function below is the coded representation of this. Let's use it to make some plots of Gaussian process learning.
def conditional(x_new, x, y, kernel):
cov_xxn = covariance(kernel, x_new, x)
cov_x = covariance(kernel, x, x)
cov_xn = covariance(kernel, x_new, x_new)
mean = cov_xxn.dot(np.linalg.pinv(cov_x)).dot(y)
variance = cov_xn - cov_xxn.dot(np.linalg.pinv(cov_x)).dot(cov_xxn.T)
return mean, variance
# First point estimate
x_new = np.atleast_2d(1.)
# No conditional, this is the first value!
y_new = np.atleast_2d(0 + rng.randn())
x = x_new
y = y_new
# Plotting
y_pred, sigma_pred = conditional(xpts, x, y, kernel=kernel)
plt.errorbar(xpts.squeeze(), y_pred.squeeze(), yerr=np.diag(sigma_pred), capsize=0, color='steelblue')
plt.plot(x, y, color='darkred', marker='o', linestyle='')
plt.xlim(-3, 3)
plt.ylim(-3, 3)
plt.figure()
# Second point estimate
x_new = np.atleast_2d(-0.7)
mu, s = conditional(x_new, x, y, kernel=kernel)
y_new = np.atleast_2d(mu + np.diag(