Deep Gaussian processes with Latent Variables#

In this notebook, we explore the use of Deep Gaussian processes [DL13] and Latent Variables to model a dataset with heteroscedastic noise. The model can be seen as a deep GP version of [DSHD18] or as doing variational inference in models from [SDHD19]. We start by fitting a single layer GP model to show that it doesn’t result in a satisfactory fit for the noise.

This notebook is inspired by prof. Neil Lawrence’s Deep Gaussian process talk, which we highly recommend watching.

[1]:

import tensorflow as tf
import gpflow
import gpflux
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

import tensorflow_probability as tfp
from sklearn.neighbors import KernelDensity

Load data#

The data comes from a motorcycle accident simulation [1] and shows some interesting behaviour. In particular the heteroscedastic nature of the noise.

[2]:

def motorcycle_data():
    """ Return inputs and outputs for the motorcycle dataset. We normalise the outputs. """
    import pandas as pd
    df = pd.read_csv("./data/motor.csv", index_col=0)
    X, Y = df["times"].values.reshape(-1, 1), df["accel"].values.reshape(-1, 1)
    Y = (Y - Y.mean()) / Y.std()
    X /= X.max()
    return X, Y

[3]:

X, Y = motorcycle_data()
num_data, d_xim = X.shape

X_MARGIN, Y_MARGIN = 0.1, 0.5
fig, ax = plt.subplots()
ax.scatter(X, Y, marker='x', color='k');
ax.set_ylim(Y.min() - Y_MARGIN, Y.max() + Y_MARGIN);
ax.set_xlim(X.min() - X_MARGIN, X.max() + X_MARGIN);

Standard single layer Sparse Variational GP#

We first show that a single layer SVGP performs quite poorly on this dataset. In the following code block we define the kernel, inducing variable, GP layer and likelihood of the shallow GP:

[4]:

NUM_INDUCING = 20

kernel = gpflow.kernels.SquaredExponential()
inducing_variable = gpflow.inducing_variables.InducingPoints(
    np.linspace(X.min(), X.max(), NUM_INDUCING).reshape(-1, 1)
)
gp_layer = gpflux.layers.GPLayer(
    kernel, inducing_variable, num_data=num_data, num_latent_gps=1
)
likelihood_layer = gpflux.layers.LikelihoodLayer(gpflow.likelihoods.Gaussian(0.1))

We can now encapsulate gp_layer in a GPflux DeepGP model:

[5]:

single_layer_dgp = gpflux.models.DeepGP([gp_layer], likelihood_layer)
model = single_layer_dgp.as_training_model()
model.compile(gpflow.keras.tf_keras.optimizers.Adam(0.01))

history = model.fit({"inputs": X, "targets": Y}, epochs=int(1e3), verbose=0)
fig, ax = plt.subplots()
ax.plot(history.history["loss"])
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')

WARNING:tensorflow:From /home/vincent/anaconda3/envs/gpflow2/lib/python3.7/site-packages/tensorflow/python/ops/linalg/linear_operator_diag.py:175: calling LinearOperator.__init__ (from tensorflow.python.ops.linalg.linear_operator) with graph_parents is deprecated and will be removed in a future version.
Instructions for updating:
Do not pass `graph_parents`.  They will  no longer be used.

[5]:

Text(0, 0.5, 'Loss')

[6]:

fig, ax = plt.subplots()
num_data_test = 200
X_test = np.linspace(X.min() - X_MARGIN, X.max() + X_MARGIN, num_data_test).reshape(-1, 1)
model = single_layer_dgp.as_prediction_model()
out = model(X_test)

mu = out.y_mean.numpy().squeeze()
var = out.y_var.numpy().squeeze()
X_test = X_test.squeeze()

for i in [1, 2]:
    lower = mu - i * np.sqrt(var)
    upper = mu + i * np.sqrt(var)
    ax.fill_between(X_test, lower, upper, color="C1", alpha=0.3)

ax.set_ylim(Y.min() - Y_MARGIN, Y.max() + Y_MARGIN)
ax.set_xlim(X.min() - X_MARGIN, X.max() + X_MARGIN)
ax.plot(X, Y, "kx", alpha=0.5)
ax.plot(X_test, mu, "C1")
ax.set_xlabel('time')
ax.set_ylabel('acc')

[6]:

Text(0, 0.5, 'acc')

The errorbars of the single layer model are not good: we observe an overestimation of the error bars on the left and right.

Latent Variable Layer#

This layer concatenates the inputs with a latent variable. See Dutordoir, Salimbeni et al. Conditional Density with Gaussian processes (2018) [DSHD18] for full details. We choose a one-dimensional input and a full parameterisation for the latent variables. This means that we do not need to train a recognition network, which is useful for fitting but can only be done in the case of small datasets, as is the case here.

[7]:

w_dim = 1
prior_means = np.zeros(w_dim)
prior_std = np.ones(w_dim)
encoder = gpflux.encoders.DirectlyParameterizedNormalDiag(num_data, w_dim)
prior = tfp.distributions.MultivariateNormalDiag(prior_means, prior_std)
lv = gpflux.layers.LatentVariableLayer(prior, encoder)

First GP layer#

GP Layer with two dimensional input because it acts on the inputs and the one-dimensional latent variable. We use a Squared Exponential kernel, a zero mean function, and inducing points, whose pseudo input locations are carefully chosen.

[8]:

kernel = gpflow.kernels.SquaredExponential(lengthscales=[.05, .2], variance=1.)
inducing_variable = gpflow.inducing_variables.InducingPoints(
    np.concatenate(
        [
            np.linspace(X.min(), X.max(), NUM_INDUCING).reshape(-1, 1),
            np.random.randn(NUM_INDUCING, 1),
        ],
        axis=1
    )
)
gp_layer = gpflux.layers.GPLayer(
    kernel,
    inducing_variable,
    num_data=num_data,
    num_latent_gps=1,
    mean_function=gpflow.mean_functions.Zero(),
)

Second GP layer#

Final layer GP with Squared Exponential kernel

[9]:

kernel = gpflow.kernels.SquaredExponential()
inducing_variable = gpflow.inducing_variables.InducingPoints(
    np.random.randn(NUM_INDUCING, 1),
)
gp_layer2 = gpflux.layers.GPLayer(
    kernel,
    inducing_variable,
    num_data=num_data,
    num_latent_gps=1,
    mean_function=gpflow.mean_functions.Identity(),
)
gp_layer2.q_sqrt.assign(gp_layer.q_sqrt * 1e-5);

[10]:

likelihood_layer = gpflux.layers.LikelihoodLayer(gpflow.likelihoods.Gaussian(0.01))
gpflow.set_trainable(likelihood_layer, False)
dgp = gpflux.models.DeepGP([lv, gp_layer, gp_layer2], likelihood_layer)
gpflow.utilities.print_summary(dgp, fmt="notebook")

name	class	transform	trainable	shape	dtype	value
DeepGP.f_layers[0]._layers[0].means DeepGP.f_layers[0].encoder.means	Parameter	Identity	True	(94, 1)	float64	[[2.01673752e-02...
DeepGP.f_layers[0]._layers[0].stds DeepGP.f_layers[0].encoder.stds	Parameter	Softplus	True	(94, 1)	float64	[[1.e-05...
DeepGP.f_layers[1].kernel.variance	Parameter	Softplus	True	()	float64	1.0
DeepGP.f_layers[1].kernel.lengthscales	Parameter	Softplus	True	(2,)	float64	[0.05 0.2 ]
DeepGP.f_layers[1].inducing_variable.Z	Parameter	Identity	True	(20, 2)	float64	[[0.04166667, 0.66191201...
DeepGP.f_layers[1].q_mu	Parameter	Identity	True	(20, 1)	float64	[[0....
DeepGP.f_layers[1].q_sqrt	Parameter	FillTriangular	True	(1, 20, 20)	float64	[[[1., 0., 0....
DeepGP.f_layers[2].kernel.variance	Parameter	Softplus	True	()	float64	1.0
DeepGP.f_layers[2].kernel.lengthscales	Parameter	Softplus	True	()	float64	1.0
DeepGP.f_layers[2].inducing_variable.Z	Parameter	Identity	True	(20, 1)	float64	[[2.48779388...
DeepGP.f_layers[2].q_mu	Parameter	Identity	True	(20, 1)	float64	[[0....
DeepGP.f_layers[2].q_sqrt	Parameter	FillTriangular	True	(1, 20, 20)	float64	[[[1.e-05, 0.e+00, 0.e+00...
DeepGP.likelihood_layer.likelihood.variance	Parameter	Softplus + Shift	False	()	float64	0.009999999999999998

Fit#

We can now fit the model. Because of the DirectlyParameterizedEncoder it is important to set the batch size to the number of datapoints and turn off shuffle. This is so that we use the associated latent variable for each datapoint. If we would use an amortized encoder network this would not be necessary.

[17]:

model = dgp.as_training_model()
model.compile(gpflow.keras.tf_keras.optimizers.Adam(0.005))
history = model.fit({"inputs": X, "targets": Y}, epochs=int(20e3), verbose=0, batch_size=num_data, shuffle=False)

[18]:

gpflow.utilities.print_summary(dgp, fmt="notebook")

name	class	transform	trainable	shape	dtype	value
DeepGP.f_layers[0]._metrics[0]._non_trainable_weights[0] DeepGP.f_layers[0]._metrics[0].total	ResourceVariable		False	()	float64	1.0780249257681602
DeepGP.f_layers[0]._metrics[0]._non_trainable_weights[1] DeepGP.f_layers[0]._metrics[0].count	ResourceVariable		False	()	float64	1.0
DeepGP.f_layers[0]._layers[0].means DeepGP.f_layers[0].encoder.means	Parameter	Identity	True	(94, 1)	float64	[[0.03641934...
DeepGP.f_layers[0]._layers[0].stds DeepGP.f_layers[0].encoder.stds	Parameter	Softplus	True	(94, 1)	float64	[[0.8204074...
DeepGP.f_layers[1]._metrics[0]._non_trainable_weights[0] DeepGP.f_layers[1]._metrics[0].total	ResourceVariable		False	()	float64	0.4804222860958632
DeepGP.f_layers[1]._metrics[0]._non_trainable_weights[1] DeepGP.f_layers[1]._metrics[0].count	ResourceVariable		False	()	float64	1.0
DeepGP.f_layers[1].kernel.variance	Parameter	Softplus	True	()	float64	0.8957803239319818
DeepGP.f_layers[1].kernel.lengthscales	Parameter	Softplus	True	(2,)	float64	[0.13181667 4.24491404]
DeepGP.f_layers[1].inducing_variable.Z	Parameter	Identity	True	(20, 2)	float64	[[0.10892382, -0.15883231...
DeepGP.f_layers[1].q_mu	Parameter	Identity	True	(20, 1)	float64	[[3.44379760e-02...
DeepGP.f_layers[1].q_sqrt	Parameter	FillTriangular	True	(1, 20, 20)	float64	[[[8.69748551e-02, 0.00000000e+00, 0.00000000e+00...
DeepGP.f_layers[2]._metrics[0]._non_trainable_weights[0] DeepGP.f_layers[2]._metrics[0].total	ResourceVariable		False	()	float64	0.14355680909786334
DeepGP.f_layers[2]._metrics[0]._non_trainable_weights[1] DeepGP.f_layers[2]._metrics[0].count	ResourceVariable		False	()	float64	1.0
DeepGP.f_layers[2].kernel.variance	Parameter	Softplus	True	()	float64	0.10763915659754926
DeepGP.f_layers[2].kernel.lengthscales	Parameter	Softplus	True	()	float64	0.7436233151733747
DeepGP.f_layers[2].inducing_variable.Z	Parameter	Identity	True	(20, 1)	float64	[[3.28383233e+00...
DeepGP.f_layers[2].q_mu	Parameter	Identity	True	(20, 1)	float64	[[-0.25372763...
DeepGP.f_layers[2].q_sqrt	Parameter	FillTriangular	True	(1, 20, 20)	float64	[[[9.55995241e-01, 0.00000000e+00, 0.00000000e+00...
DeepGP.likelihood_layer.likelihood.variance	Parameter	Softplus + Shift	False	()	float64	0.009999999999999998

Prediction and plotting code#

[19]:

Xs = np.linspace(X.min() - X_MARGIN, X.max() + X_MARGIN, num_data_test).reshape(-1, 1)

def predict_y_samples(prediction_model, Xs, num_samples=25):
    samples = []
    for i in tqdm(range(num_samples)):
        out = prediction_model(Xs)
        s = out.y_mean + out.y_var ** .5 * tf.random.normal(tf.shape(out.y_mean), dtype=out.y_mean.dtype)
        samples.append(s)
    return tf.concat(samples, axis=1)


def plot_samples(ax, N_samples=25):
    samples = predict_y_samples(dgp.as_prediction_model(), Xs, N_samples).numpy().T
    Xs_tiled = np.tile(Xs, [N_samples, 1])
    ax.scatter(Xs_tiled.flatten(), samples.flatten(), marker='.', alpha=0.2, color='C0')
    ax.set_ylim(-2.5, 2.5)
    ax.set_xlim(min(Xs), max(Xs))
    ax.scatter(X, Y, marker='.', color='C1')


def plot_latent_variables(ax):
    for l in dgp.f_layers:
        if isinstance(l, gpflux.layers.LatentVariableLayer):
            m = l.encoder.means.numpy()
            s = l.encoder.stds.numpy()
            ax.errorbar(X.flatten(), m.flatten(), yerr=s.flatten(), fmt='o')
            return

[20]:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
plot_samples(ax1)
plot_latent_variables(ax2)

100%|██████████| 25/25 [00:01<00:00, 24.88it/s]

Left we show the dataset and posterior samples of \(y\). On the right we plot the mean and std. deviation of the latent variables corresponding to the datapoints.

[23]:

def plot_mean_and_var(ax, samples=None, N_samples=5_000):
    if samples is None:
        samples = predict_y_samples(dgp.as_prediction_model(), Xs, N_samples).numpy().T

    m = np.mean(samples, 0).flatten()
    v = np.var(samples, 0).flatten()

    ax.plot(Xs.flatten(), m, "C1")
    for i in [1, 2]:
        lower = m - i * np.sqrt(v)
        upper = m + i * np.sqrt(v)
        ax.fill_between(Xs.flatten(), lower, upper, color="C1", alpha=0.3)
    ax.plot(X, Y, "kx", alpha=0.5)
    ax.set_ylim(Y.min() - Y_MARGIN, Y.max() + Y_MARGIN)
    ax.set_xlabel("time")
    ax.set_ylabel("acceleration")
    return samples

[24]:

fig, ax = plt.subplots()
plot_mean_and_var(ax);

100%|██████████| 5000/5000 [03:24<00:00, 24.44it/s]

The deep GP model can handle the heteroscedastic noise in the dataset as well as the sharp-ish transition at \(0.3\).

Conclusion#

In this notebook we created a two layer deep Gaussian process with a latent variable base layer to model a heteroscedastic dataset using GPflux.

[1] Silverman, B. W. (1985) “Some aspects of the spline smoothing approach to non-parametric curve fitting”. Journal of the Royal Statistical Society series B 47, 1-52.