Using deep Gaussian processes with GPflux for Bayesian optimization.#

[1]:
import numpy as np
import tensorflow as tf

np.random.seed(1794)
tf.random.set_seed(1794)

Describe the problem#

In this notebook, we show how to use deep Gaussian processes (DGPs) for Bayesian optimization using Trieste and GPflux. DGPs may be better for modeling non-stationary objective functions than standard GP surrogates, as discussed in [DKvdH+17, HBB+19].

In this example, we look to find the minimum value of the two- and five-dimensional Michalewicz functions over the hypercubes \([0, pi]^2\)/\([0, pi]^5\). We compare a two-layer DGP model with GPR, using Thompson sampling for both.

The Michalewicz functions are highly non-stationary and have a global minimum that’s hard to find, so DGPs might be more suitable than standard GPs, which may struggle because they typically have stationary kernels that cannot easily model non-stationarities.

[2]:
import gpflow
from trieste.objectives import Michalewicz2, Michalewicz5
from trieste.objectives.utils import mk_observer
from trieste.experimental.plotting import plot_function_plotly

function = Michalewicz2.objective
F_MINIMIZER = Michalewicz2.minimum

search_space = Michalewicz2.search_space

fig = plot_function_plotly(function, search_space.lower, search_space.upper)
fig.update_layout(height=800, width=800)
fig.show()

Sample the observer over the search space#

We set up the observer as usual, using Sobol sampling to sample the initial points.

[3]:
import trieste

observer = mk_observer(function)

num_initial_points = 20
num_steps = 20
initial_query_points = search_space.sample_sobol(num_initial_points)
initial_data = observer(initial_query_points)

Model the objective function#

The Bayesian optimization procedure estimates the next best points to query by using a probabilistic model of the objective. We’ll use a two layer deep Gaussian process (DGP), built using GPflux. We also compare to a (shallow) GP.

Since DGPs can be hard to build, Trieste provides some basic architectures: here we use the build_vanilla_deep_gp function which returns a GPflux model of DeepGP class. As with other models (e.g. GPflow), we cannot use it directly in Bayesian optimization routines, we need to pass it through an appropriate wrapper, DeepGaussianProcess wrapper in this case. Additionally, since the GPflux interface does not currently support copying DGP architectures, if we wish to have the Bayesian optimizer track the model state, we need to pass in the DGP as a callable closure so that the architecture can be recreated when required (alternatively, we can set set_state=False on the optimize call).

A few other useful notes regarding building a DGP model: The DGP model requires us to specify the number of inducing points, as we don’t have the true posterior. To train the model we have to use a stochastic optimizer; Adam is used by default, but we can use other stochastic optimizers from TensorFlow. GPflux allows us to use the Keras fit method, which makes optimizing a lot easier - this method is used in the background for training the model.

[4]:
from functools import partial

from trieste.models.gpflux import DeepGaussianProcess, build_vanilla_deep_gp


def build_dgp_model(data, search_space):
    dgp = partial(
        build_vanilla_deep_gp,
        data,
        search_space,
        2,
        100,
        likelihood_variance=1e-5,
        trainable_likelihood=False,
    )
    return DeepGaussianProcess(dgp)


dgp_model = build_dgp_model(initial_data, search_space)
WARNING:tensorflow:From /opt/hostedtoolcache/Python/3.7.16/x64/lib/python3.7/site-packages/tensorflow_probability/python/distributions/distribution.py:342: calling MultivariateNormalDiag.__init__ (from tensorflow_probability.python.distributions.mvn_diag) with scale_identity_multiplier is deprecated and will be removed after 2020-01-01.
Instructions for updating:
`scale_identity_multiplier` is deprecated; please combine it into `scale_diag` directly instead.

Run the optimization loop#

We can now run the Bayesian optimization loop by defining a BayesianOptimizer and calling its optimize method.

The optimizer uses an acquisition rule to choose where in the search space to try on each optimization step. We’ll start by using Thompson sampling.

We’ll run the optimizer for twenty steps. Note: this may take a while!

[5]:
from trieste.acquisition.rule import DiscreteThompsonSampling

bo = trieste.bayesian_optimizer.BayesianOptimizer(observer, search_space)
grid_size = 1000
acquisition_rule = DiscreteThompsonSampling(grid_size, 1)

dgp_result = bo.optimize(
    num_steps,
    initial_data,
    dgp_model,
    acquisition_rule=acquisition_rule,
)
dgp_dataset = dgp_result.try_get_final_dataset()
WARNING:tensorflow:5 out of the last 5 calls to <function dgp_feature_decomposition_trajectory.__call__ at 0x7f9bf1f01b90> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
WARNING:tensorflow:6 out of the last 6 calls to <function dgp_feature_decomposition_trajectory.__call__ at 0x7f9bf1776e60> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
Optimization completed without errors

Explore the results#

We can now get the best point found by the optimizer. Note this isn’t necessarily the point that was last evaluated.

[6]:
dgp_query_points = dgp_dataset.query_points.numpy()
dgp_observations = dgp_dataset.observations.numpy()

dgp_arg_min_idx = tf.squeeze(tf.argmin(dgp_observations, axis=0))

print(f"query point: {dgp_query_points[dgp_arg_min_idx, :]}")
print(f"observation: {dgp_observations[dgp_arg_min_idx, :]}")
query point: [2.79739396 1.57236648]
observation: [-0.99991498]

We can visualise how the optimizer performed as a three-dimensional plot

[7]:
from trieste.experimental.plotting import add_bo_points_plotly

fig = plot_function_plotly(
    function, search_space.lower, search_space.upper, alpha=0.5
)
fig.update_layout(height=800, width=800)

fig = add_bo_points_plotly(
    x=dgp_query_points[:, 0],
    y=dgp_query_points[:, 1],
    z=dgp_observations[:, 0],
    num_init=num_initial_points,
    idx_best=dgp_arg_min_idx,
    fig=fig,
)
fig.show()

We can visualise the model over the objective function by plotting the mean and 95% confidence intervals of its predictive distribution. Note that the DGP model is able to model the local structure of the true objective function.

[8]:
import matplotlib.pyplot as plt
from trieste.experimental.plotting import (
    plot_regret,
    plot_model_predictions_plotly,
)

fig = plot_model_predictions_plotly(
    dgp_result.try_get_final_model(),
    search_space.lower,
    search_space.upper,
    num_samples=100,
)

fig = add_bo_points_plotly(
    x=dgp_query_points[:, 0],
    y=dgp_query_points[:, 1],
    z=dgp_observations[:, 0],
    num_init=num_initial_points,
    idx_best=dgp_arg_min_idx,
    fig=fig,
    figrow=1,
    figcol=1,
)
fig.update_layout(height=800, width=800)
fig.show()

We now compare to a GP model with priors over the hyperparameters. We do not expect this to do as well because GP models cannot deal with non-stationary functions well.

[9]:
import gpflow
import tensorflow_probability as tfp

from trieste.models.gpflow import GaussianProcessRegression, build_gpr

gpflow_model = build_gpr(initial_data, search_space, likelihood_variance=1e-7)
gp_model = GaussianProcessRegression(gpflow_model)

bo = trieste.bayesian_optimizer.BayesianOptimizer(observer, search_space)

result = bo.optimize(
    num_steps,
    initial_data,
    gp_model,
    acquisition_rule=acquisition_rule,
)
gp_dataset = result.try_get_final_dataset()

gp_query_points = gp_dataset.query_points.numpy()
gp_observations = gp_dataset.observations.numpy()

gp_arg_min_idx = tf.squeeze(tf.argmin(gp_observations, axis=0))

print(f"query point: {gp_query_points[gp_arg_min_idx, :]}")
print(f"observation: {gp_observations[gp_arg_min_idx, :]}")

fig = plot_model_predictions_plotly(
    result.try_get_final_model(),
    search_space.lower,
    search_space.upper,
)

fig = add_bo_points_plotly(
    x=gp_query_points[:, 0],
    y=gp_query_points[:, 1],
    z=gp_observations[:, 0],
    num_init=num_initial_points,
    idx_best=gp_arg_min_idx,
    fig=fig,
    figrow=1,
    figcol=1,
)
fig.update_layout(height=800, width=800)
fig.show()
Optimization completed without errors
query point: [2.19749989 1.58411493]
observation: [-1.79361068]

We see that the DGP model does a much better job at understanding the structure of the function. The standard Gaussian process model has a large signal variance and small lengthscales, which do not result in a good model of the true objective. On the other hand, the DGP model is at least able to infer the local structure around the observations.

We can also plot the regret curves of the two models side-by-side.

[10]:

gp_suboptimality = gp_observations - F_MINIMIZER.numpy() dgp_suboptimality = dgp_observations - F_MINIMIZER.numpy() _, ax = plt.subplots(1, 2) plot_regret( dgp_suboptimality, ax[0], num_init=num_initial_points, idx_best=dgp_arg_min_idx, ) plot_regret( gp_suboptimality, ax[1], num_init=num_initial_points, idx_best=gp_arg_min_idx, ) ax[0].set_yscale("log") ax[0].set_ylabel("Regret") ax[0].set_ylim(0.5, 3) ax[0].set_xlabel("# evaluations") ax[0].set_title("DGP") ax[1].set_title("GP") ax[1].set_yscale("log") ax[1].set_ylim(0.5, 3) ax[1].set_xlabel("# evaluations")
[10]:
Text(0.5, 0, '# evaluations')
../_images/notebooks_deep_gaussian_processes_19_1.png

We might also expect that the DGP model will do better on higher dimensional data. We explore this by testing a higher-dimensional version of the Michalewicz dataset.

Set up the problem.

[11]:

function = Michalewicz5.objective F_MINIMIZER = Michalewicz5.minimum search_space = Michalewicz5.search_space observer = mk_observer(function) num_initial_points = 50 num_steps = 50 initial_query_points = search_space.sample_sobol(num_initial_points) initial_data = observer(initial_query_points)

Build the DGP model and run the Bayes opt loop.

[12]:

dgp_model = build_dgp_model(initial_data, search_space) bo = trieste.bayesian_optimizer.BayesianOptimizer(observer, search_space) acquisition_rule = DiscreteThompsonSampling(grid_size, 1) dgp_result = bo.optimize( num_steps, initial_data, dgp_model, acquisition_rule=acquisition_rule, ) dgp_dataset = dgp_result.try_get_final_dataset() dgp_query_points = dgp_dataset.query_points.numpy() dgp_observations = dgp_dataset.observations.numpy() dgp_arg_min_idx = tf.squeeze(tf.argmin(dgp_observations, axis=0)) print(f"query point: {dgp_query_points[dgp_arg_min_idx, :]}") print(f"observation: {dgp_observations[dgp_arg_min_idx, :]}") dgp_suboptimality = dgp_observations - F_MINIMIZER.numpy()
Optimization completed without errors
query point: [2.18930493 2.30343265 1.30477392 0.87460425 1.74531586]
observation: [-2.57047565]

Repeat the above for the GP model.

[13]:

gpflow_model = build_gpr(initial_data, search_space, likelihood_variance=1e-7) gp_model = GaussianProcessRegression(gpflow_model) bo = trieste.bayesian_optimizer.BayesianOptimizer(observer, search_space) result = bo.optimize( num_steps, initial_data, gp_model, acquisition_rule=acquisition_rule, ) gp_dataset = result.try_get_final_dataset() gp_query_points = gp_dataset.query_points.numpy() gp_observations = gp_dataset.observations.numpy() gp_arg_min_idx = tf.squeeze(tf.argmin(gp_observations, axis=0)) print(f"query point: {gp_query_points[gp_arg_min_idx, :]}") print(f"observation: {gp_observations[gp_arg_min_idx, :]}") gp_suboptimality = gp_observations - F_MINIMIZER.numpy()
Optimization completed without errors
query point: [2.26823028 2.69381407 1.28974228 1.22838305 0.81679683]
observation: [-2.40526686]

Plot the regret.

[14]:

_, ax = plt.subplots(1, 2) plot_regret( dgp_suboptimality, ax[0], num_init=num_initial_points, idx_best=dgp_arg_min_idx, ) plot_regret( gp_suboptimality, ax[1], num_init=num_initial_points, idx_best=gp_arg_min_idx, ) ax[0].set_yscale("log") ax[0].set_ylabel("Regret") ax[0].set_ylim(1.5, 6) ax[0].set_xlabel("# evaluations") ax[0].set_title("DGP") ax[1].set_title("GP") ax[1].set_yscale("log") ax[1].set_ylim(1.5, 6) ax[1].set_xlabel("# evaluations")
[14]:
Text(0.5, 0, '# evaluations')
../_images/notebooks_deep_gaussian_processes_27_1.png

While still far from the optimum, it is considerably better than the GP.