Gaussian Processes

  • In Gaussian processes we treat each test point as a random variable
  • A multivariate Gaussian distribution has the same number of dimensions as the number of random variables
  • Can derive the posterior from the prior and our observations $$\begin{pmatrix} f \\ f_{*} \end{pmatrix} \sim N \begin{pmatrix} \begin{pmatrix} \mu \\ \mu_{*} \end{pmatrix}, \begin{pmatrix} K, K_{*} \\ K_{*}^{T}, K_{**} \end{pmatrix} \end{pmatrix}$$

f, outcome value of training data; $f_{*}$, outcome values of test data

$\mu$, mean of training data; $\mu_{*}$, mean of test data

K, covariance of training data and training data, represents the similarity of two data sets

$K_{*}$, covariance of training data and test data

$K_{**}$, covariance of test data

$$LL^{T} = K$$$$f_{*} \sim \mu + LN(0, 1)$$
  • Pros
    • Non-parametric approach
    • Predictions can interpolate observations
    • Predictions are probabilistic so that one can compute confidence intervals
    • Flexibility: you can use different kernels
  • Cons
    • They lose efficiency in high dimensional spaces – namely when the number of features exceeds a few dozens
In [134]:
import numpy as np
import matplotlib.pyplot as pl
In [135]:
def kernel(a, b, param):
    sqdist = np.sum(a**2,1).reshape(-1,1) + np.sum(b**2,1) - 2*np.dot(a, b.T)
    return np.exp(-.5 * (1/param) * sqdist)
In [136]:
# Training data
Xtrain = np.array([-4, -3, -2, -1, 1]).reshape(5,1)
ytrain = np.sin(Xtrain) + 10

# Test data
n = 50
Xtest = np.linspace(-5, 5, n).reshape(-1,1) # 50*1
In [144]:
# Calculate Covariance Matrices
param = 0.1
K = kernel(Xtrain, Xtrain, param) # 5*5
K_s = kernel(Xtrain, Xtest, param) # 5*50
K_ss = kernel(Xtest, Xtest, param) # 50*50
In [145]:
# Calculate mean and standard derivation from training data and test data
L = np.linalg.cholesky(K + 0.00005*np.eye(len(Xtrain))) # 5*5
Lk = np.linalg.solve(L, K_s) # L, 5*5, Lk, 5*50
mu = np.dot(Lk.T, np.linalg.solve(L, ytrain)).reshape((n,)) # 50*1, mu of test data

stdv = np.sqrt(np.diag(K_ss) - np.sum(Lk**2, axis=0)) # 50*1, std of test data
In [146]:
# Calculate Posterior f(X_test|X_training)
L = np.linalg.cholesky(K_ss + 1e-6*np.eye(n) - np.dot(Lk.T, Lk)) # 50*50
f_post = mu.reshape(-1,1) + np.dot(L, np.random.normal(size=(n,1))) # 50*1
In [140]:
pl.plot(Xtrain, ytrain, 'bs', ms=8)
pl.plot(Xtest, f_post)
pl.gca().fill_between(Xtest.flat, mu-2*stdv, mu+2*stdv, color="#dddddd")
pl.plot(Xtest, mu, 'r--', lw=2)
pl.axis([-5, 5, -5, 15])
pl.title('Three samples from the GP posterior')
pl.show()
In [149]:
# Calculate Prior f(X_test)
L = np.linalg.cholesky(K_ss + 1e-15*np.eye(n)) # 50*50
f_prior = np.dot(L, np.random.normal(size=(n,1))) # 50*50, 50*1, mu is 0
In [150]:
pl.plot(Xtest, f_prior)
pl.axis([-5, 5, -3, 3])
pl.title('Three samples from the GP prior')
pl.show()

Gaussian Process Regression (GPR)

In [153]:
from sklearn.datasets import make_friedman2
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel
In [154]:
X, y = make_friedman2(n_samples=500, noise=0, random_state=0)
In [159]:
kernel = DotProduct() + WhiteKernel()
model = GaussianProcessRegressor(kernel=kernel, random_state=0)

model.fit(X, y)
Out[159]:
GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=DotProduct(sigma_0=1) + WhiteKernel(noise_level=1),
                         n_restarts_optimizer=0, normalize_y=False,
                         optimizer='fmin_l_bfgs_b', random_state=0)
In [160]:
model.predict(X[:2,:], return_std=True)
Out[160]:
(array([653.08792288, 592.16905327]), array([316.68016218, 316.65121679]))

Gaussian Process Classification (GPC)

In [164]:
from sklearn.datasets import load_iris
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
In [166]:
X, y = load_iris(return_X_y=True)
In [168]:
kernel = 1.0 * RBF(1.0)
model = GaussianProcessClassifier(kernel=kernel, random_state=0)

model.fit(X, y)
Out[168]:
GaussianProcessClassifier(copy_X_train=True, kernel=1**2 * RBF(length_scale=1),
                          max_iter_predict=100, multi_class='one_vs_rest',
                          n_jobs=None, n_restarts_optimizer=0,
                          optimizer='fmin_l_bfgs_b', random_state=0,
                          warm_start=False)
In [171]:
model.predict_proba(X[:2,:])
Out[171]:
array([[0.83548752, 0.03228706, 0.13222543],
       [0.79064206, 0.06525643, 0.14410151]])