Deep Gaussian processes with Latent Variables#

In this notebook, we explore the use of Deep Gaussian processes [DL13] and Latent Variables to model a dataset with heteroscedastic noise. The model can be seen as a deep GP version of [DSHD18] or as doing variational inference in models from [SDHD19]. We start by fitting a single layer GP model to show that it doesn’t result in a satisfactory fit for the noise.

This notebook is inspired by prof. Neil Lawrence’s Deep Gaussian process talk, which we highly recommend watching.

[1]:
import tensorflow as tf
import gpflow
import gpflux
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

import tensorflow_probability as tfp
from sklearn.neighbors import KernelDensity

Load data#

The data comes from a motorcycle accident simulation [1] and shows some interesting behaviour. In particular the heteroscedastic nature of the noise.

[2]:
def motorcycle_data():
    """ Return inputs and outputs for the motorcycle dataset. We normalise the outputs. """
    import pandas as pd
    df = pd.read_csv("./data/motor.csv", index_col=0)
    X, Y = df["times"].values.reshape(-1, 1), df["accel"].values.reshape(-1, 1)
    Y = (Y - Y.mean()) / Y.std()
    X /= X.max()
    return X, Y
[3]:
X, Y = motorcycle_data()
num_data, d_xim = X.shape

X_MARGIN, Y_MARGIN = 0.1, 0.5
fig, ax = plt.subplots()
ax.scatter(X, Y, marker='x', color='k');
ax.set_ylim(Y.min() - Y_MARGIN, Y.max() + Y_MARGIN);
ax.set_xlim(X.min() - X_MARGIN, X.max() + X_MARGIN);
../_images/notebooks_deep_cde_4_0.png

Standard single layer Sparse Variational GP#

We first show that a single layer SVGP performs quite poorly on this dataset. In the following code block we define the kernel, inducing variable, GP layer and likelihood of the shallow GP:

[4]:
NUM_INDUCING = 20

kernel = gpflow.kernels.SquaredExponential()
inducing_variable = gpflow.inducing_variables.InducingPoints(
    np.linspace(X.min(), X.max(), NUM_INDUCING).reshape(-1, 1)
)
gp_layer = gpflux.layers.GPLayer(
    kernel, inducing_variable, num_data=num_data, num_latent_gps=1
)
likelihood_layer = gpflux.layers.LikelihoodLayer(gpflow.likelihoods.Gaussian(0.1))


We can now encapsulate gp_layer in a GPflux DeepGP model:

[5]:

single_layer_dgp = gpflux.models.DeepGP([gp_layer], likelihood_layer) model = single_layer_dgp.as_training_model() model.compile(gpflow.keras.tf_keras.optimizers.Adam(0.01)) history = model.fit({"inputs": X, "targets": Y}, epochs=int(1e3), verbose=0) fig, ax = plt.subplots() ax.plot(history.history["loss"]) ax.set_xlabel('Epoch') ax.set_ylabel('Loss')
WARNING:tensorflow:From /home/vincent/anaconda3/envs/gpflow2/lib/python3.7/site-packages/tensorflow/python/ops/linalg/linear_operator_diag.py:175: calling LinearOperator.__init__ (from tensorflow.python.ops.linalg.linear_operator) with graph_parents is deprecated and will be removed in a future version.
Instructions for updating:
Do not pass `graph_parents`.  They will  no longer be used.
[5]:
Text(0, 0.5, 'Loss')
../_images/notebooks_deep_cde_8_2.png
[6]:
fig, ax = plt.subplots()
num_data_test = 200
X_test = np.linspace(X.min() - X_MARGIN, X.max() + X_MARGIN, num_data_test).reshape(-1, 1)
model = single_layer_dgp.as_prediction_model()
out = model(X_test)

mu = out.y_mean.numpy().squeeze()
var = out.y_var.numpy().squeeze()
X_test = X_test.squeeze()

for i in [1, 2]:
    lower = mu - i * np.sqrt(var)
    upper = mu + i * np.sqrt(var)
    ax.fill_between(X_test, lower, upper, color="C1", alpha=0.3)

ax.set_ylim(Y.min() - Y_MARGIN, Y.max() + Y_MARGIN)
ax.set_xlim(X.min() - X_MARGIN, X.max() + X_MARGIN)
ax.plot(X, Y, "kx", alpha=0.5)
ax.plot(X_test, mu, "C1")
ax.set_xlabel('time')
ax.set_ylabel('acc')

[6]:
Text(0, 0.5, 'acc')
../_images/notebooks_deep_cde_9_1.png

The errorbars of the single layer model are not good: we observe an overestimation of the error bars on the left and right.

Latent Variable Layer#

This layer concatenates the inputs with a latent variable. See Dutordoir, Salimbeni et al. Conditional Density with Gaussian processes (2018) [DSHD18] for full details. We choose a one-dimensional input and a full parameterisation for the latent variables. This means that we do not need to train a recognition network, which is useful for fitting but can only be done in the case of small datasets, as is the case here.

[7]:
w_dim = 1
prior_means = np.zeros(w_dim)
prior_std = np.ones(w_dim)
encoder = gpflux.encoders.DirectlyParameterizedNormalDiag(num_data, w_dim)
prior = tfp.distributions.MultivariateNormalDiag(prior_means, prior_std)
lv = gpflux.layers.LatentVariableLayer(prior, encoder)

First GP layer#

GP Layer with two dimensional input because it acts on the inputs and the one-dimensional latent variable. We use a Squared Exponential kernel, a zero mean function, and inducing points, whose pseudo input locations are carefully chosen.

[8]:

kernel = gpflow.kernels.SquaredExponential(lengthscales=[.05, .2], variance=1.) inducing_variable = gpflow.inducing_variables.InducingPoints( np.concatenate( [ np.linspace(X.min(), X.max(), NUM_INDUCING).reshape(-1, 1), np.random.randn(NUM_INDUCING, 1), ], axis=1 ) ) gp_layer = gpflux.layers.GPLayer( kernel, inducing_variable, num_data=num_data, num_latent_gps=1, mean_function=gpflow.mean_functions.Zero(), )

Second GP layer#

Final layer GP with Squared Exponential kernel

[9]:

kernel = gpflow.kernels.SquaredExponential() inducing_variable = gpflow.inducing_variables.InducingPoints( np.random.randn(NUM_INDUCING, 1), ) gp_layer2 = gpflux.layers.GPLayer( kernel, inducing_variable, num_data=num_data, num_latent_gps=1, mean_function=gpflow.mean_functions.Identity(), ) gp_layer2.q_sqrt.assign(gp_layer.q_sqrt * 1e-5);
[10]:

likelihood_layer = gpflux.layers.LikelihoodLayer(gpflow.likelihoods.Gaussian(0.01)) gpflow.set_trainable(likelihood_layer, False) dgp = gpflux.models.DeepGP([lv, gp_layer, gp_layer2], likelihood_layer) gpflow.utilities.print_summary(dgp, fmt="notebook")
name class transform prior trainable shape dtype value
DeepGP.f_layers[0]._layers[0].means DeepGP.f_layers[0].encoder.means ParameterIdentity True (94, 1) float64[[2.01673752e-02...
DeepGP.f_layers[0]._layers[0].stds DeepGP.f_layers[0].encoder.stds ParameterSoftplus True (94, 1) float64[[1.e-05...
DeepGP.f_layers[1].kernel.variance ParameterSoftplus True () float641.0
DeepGP.f_layers[1].kernel.lengthscales ParameterSoftplus True (2,) float64[0.05 0.2 ]
DeepGP.f_layers[1].inducing_variable.Z ParameterIdentity True (20, 2) float64[[0.04166667, 0.66191201...
DeepGP.f_layers[1].q_mu ParameterIdentity True (20, 1) float64[[0....
DeepGP.f_layers[1].q_sqrt ParameterFillTriangular True (1, 20, 20)float64[[[1., 0., 0....
DeepGP.f_layers[2].kernel.variance ParameterSoftplus True () float641.0
DeepGP.f_layers[2].kernel.lengthscales ParameterSoftplus True () float641.0
DeepGP.f_layers[2].inducing_variable.Z ParameterIdentity True (20, 1) float64[[2.48779388...
DeepGP.f_layers[2].q_mu ParameterIdentity True (20, 1) float64[[0....
DeepGP.f_layers[2].q_sqrt ParameterFillTriangular True (1, 20, 20)float64[[[1.e-05, 0.e+00, 0.e+00...
DeepGP.likelihood_layer.likelihood.varianceParameterSoftplus + Shift False () float640.009999999999999998

Fit#

We can now fit the model. Because of the DirectlyParameterizedEncoder it is important to set the batch size to the number of datapoints and turn off shuffle. This is so that we use the associated latent variable for each datapoint. If we would use an amortized encoder network this would not be necessary.

[17]:
model = dgp.as_training_model()
model.compile(gpflow.keras.tf_keras.optimizers.Adam(0.005))
history = model.fit({"inputs": X, "targets": Y}, epochs=int(20e3), verbose=0, batch_size=num_data, shuffle=False)
[18]:
gpflow.utilities.print_summary(dgp, fmt="notebook")
name class transform prior trainable shape dtype value
DeepGP.f_layers[0]._metrics[0]._non_trainable_weights[0] DeepGP.f_layers[0]._metrics[0].total ResourceVariable False () float641.0780249257681602
DeepGP.f_layers[0]._metrics[0]._non_trainable_weights[1] DeepGP.f_layers[0]._metrics[0].count ResourceVariable False () float641.0
DeepGP.f_layers[0]._layers[0].means DeepGP.f_layers[0].encoder.means Parameter Identity True (94, 1) float64[[0.03641934...
DeepGP.f_layers[0]._layers[0].stds DeepGP.f_layers[0].encoder.stds Parameter Softplus True (94, 1) float64[[0.8204074...
DeepGP.f_layers[1]._metrics[0]._non_trainable_weights[0] DeepGP.f_layers[1]._metrics[0].total ResourceVariable False () float640.4804222860958632
DeepGP.f_layers[1]._metrics[0]._non_trainable_weights[1] DeepGP.f_layers[1]._metrics[0].count ResourceVariable False () float641.0
DeepGP.f_layers[1].kernel.variance Parameter Softplus True () float640.8957803239319818
DeepGP.f_layers[1].kernel.lengthscales Parameter Softplus True (2,) float64[0.13181667 4.24491404]
DeepGP.f_layers[1].inducing_variable.Z Parameter Identity True (20, 2) float64[[0.10892382, -0.15883231...
DeepGP.f_layers[1].q_mu Parameter Identity True (20, 1) float64[[3.44379760e-02...
DeepGP.f_layers[1].q_sqrt Parameter FillTriangular True (1, 20, 20)float64[[[8.69748551e-02, 0.00000000e+00, 0.00000000e+00...
DeepGP.f_layers[2]._metrics[0]._non_trainable_weights[0] DeepGP.f_layers[2]._metrics[0].total ResourceVariable False () float640.14355680909786334
DeepGP.f_layers[2]._metrics[0]._non_trainable_weights[1] DeepGP.f_layers[2]._metrics[0].count ResourceVariable False () float641.0
DeepGP.f_layers[2].kernel.variance Parameter Softplus True () float640.10763915659754926
DeepGP.f_layers[2].kernel.lengthscales Parameter Softplus True () float640.7436233151733747
DeepGP.f_layers[2].inducing_variable.Z Parameter Identity True (20, 1) float64[[3.28383233e+00...
DeepGP.f_layers[2].q_mu Parameter Identity True (20, 1) float64[[-0.25372763...
DeepGP.f_layers[2].q_sqrt Parameter FillTriangular True (1, 20, 20)float64[[[9.55995241e-01, 0.00000000e+00, 0.00000000e+00...
DeepGP.likelihood_layer.likelihood.varianceParameter Softplus + Shift False () float640.009999999999999998

Prediction and plotting code#

[19]:
Xs = np.linspace(X.min() - X_MARGIN, X.max() + X_MARGIN, num_data_test).reshape(-1, 1)

def predict_y_samples(prediction_model, Xs, num_samples=25):
    samples = []
    for i in tqdm(range(num_samples)):
        out = prediction_model(Xs)
        s = out.y_mean + out.y_var ** .5 * tf.random.normal(tf.shape(out.y_mean), dtype=out.y_mean.dtype)
        samples.append(s)
    return tf.concat(samples, axis=1)


def plot_samples(ax, N_samples=25):
    samples = predict_y_samples(dgp.as_prediction_model(), Xs, N_samples).numpy().T
    Xs_tiled = np.tile(Xs, [N_samples, 1])
    ax.scatter(Xs_tiled.flatten(), samples.flatten(), marker='.', alpha=0.2, color='C0')
    ax.set_ylim(-2.5, 2.5)
    ax.set_xlim(min(Xs), max(Xs))
    ax.scatter(X, Y, marker='.', color='C1')


def plot_latent_variables(ax):
    for l in dgp.f_layers:
        if isinstance(l, gpflux.layers.LatentVariableLayer):
            m = l.encoder.means.numpy()
            s = l.encoder.stds.numpy()
            ax.errorbar(X.flatten(), m.flatten(), yerr=s.flatten(), fmt='o')
            return

[20]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
plot_samples(ax1)
plot_latent_variables(ax2)
100%|██████████| 25/25 [00:01<00:00, 24.88it/s]
../_images/notebooks_deep_cde_25_1.png

Left we show the dataset and posterior samples of \(y\). On the right we plot the mean and std. deviation of the latent variables corresponding to the datapoints.

[23]:
def plot_mean_and_var(ax, samples=None, N_samples=5_000):
    if samples is None:
        samples = predict_y_samples(dgp.as_prediction_model(), Xs, N_samples).numpy().T

    m = np.mean(samples, 0).flatten()
    v = np.var(samples, 0).flatten()

    ax.plot(Xs.flatten(), m, "C1")
    for i in [1, 2]:
        lower = m - i * np.sqrt(v)
        upper = m + i * np.sqrt(v)
        ax.fill_between(Xs.flatten(), lower, upper, color="C1", alpha=0.3)
    ax.plot(X, Y, "kx", alpha=0.5)
    ax.set_ylim(Y.min() - Y_MARGIN, Y.max() + Y_MARGIN)
    ax.set_xlabel("time")
    ax.set_ylabel("acceleration")
    return samples

[24]:
fig, ax = plt.subplots()
plot_mean_and_var(ax);
100%|██████████| 5000/5000 [03:24<00:00, 24.44it/s]
../_images/notebooks_deep_cde_28_1.png

The deep GP model can handle the heteroscedastic noise in the dataset as well as the sharp-ish transition at \(0.3\).

Conclusion#

In this notebook we created a two layer deep Gaussian process with a latent variable base layer to model a heteroscedastic dataset using GPflux.

[1] Silverman, B. W. (1985) “Some aspects of the spline smoothing approach to non-parametric curve fitting”. Journal of the Royal Statistical Society series B 47, 1-52.