Bayesian optimization with deep ensembles#
Gaussian processes as a surrogate models are hard to beat on smaller datasets and optimization budgets. However, they scale poorly with amount of data, cannot easily capture non-stationarities and they are rather slow at prediction time. Here we show how uncertainty-aware neural networks can be effective alternative to Gaussian processes in Bayesian optimisation, in particular for large budgets, non-stationary objective functions or when predictions need to be made quickly.
Check out our tutorial on Deep Gaussian Processes for Bayesian optimization as another alternative model type supported by Trieste that can model non-stationary functions (but also deal well with small datasets).
Let’s start by importing some essential packages and modules.
[1]:
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import numpy as np
import tensorflow as tf
import trieste
# silence TF warnings and info messages, only print errors
# https://stackoverflow.com/questions/35911252/disable-tensorflow-debugging-information
tf.get_logger().setLevel("ERROR")
np.random.seed(1794)
tf.random.set_seed(1794)
Deep ensembles#
Deep neural networks typically output only mean predictions, not posterior distributions as probabilistic models such as Gaussian processes do. Posterior distributions encode mean predictions, but also epistemic uncertainty - type of uncertainty that stems from model misspecification, and which can be eliminated with further data. Aleatoric uncertainty that stems from stochasticity of the data generating process is not contained in the posterior, but can be learned from the data. Bayesian optimization requires probabilistic models because epistemic uncertainty plays a key role in balancing between exploration and exploitation.
Recently, however, there has been some development of uncertainty-aware deep neural networks. Ensembles of deep neural networks, introduced recently by [LPB16], is a type of such networks. Main ingredients are probabilistic feed-forward networks as members of the ensemble, where the final layers is a Gaussian distribution, training with maximum likelihood instead of typical root mean square error, and different random initialization of weights for generating diversity among the networks.
Monte carlo dropout ([GG16]), Bayes-by-backprop ([BCKW15]) or evidential deep regression ([ASSR19]) are some of the other types of uncertainty-aware deep neural networks. Systematic comparisons however show that deep ensembles represent the uncertainty the best and are probably the simplest of the major alternatives (see, for example, [OWA+21]). Good estimates of uncertainty makes deep ensembles a potentially attractive model for Bayesian optimization.
How good is uncertainty representation of deep ensembles?#
We will use a simple one-dimensional toy problem introduced by [HernandezLA15], which was used in [LPB16] to provide some illustrative evidence that deep ensembles do a good job of estimating uncertainty. We will replicate this exercise here.
The toy problem is a simple cubic function with some Normally distributed noise around it. We will randomly sample 20 input points from [-4,4] interval that we will use as a training data later on.
[2]:
from trieste.space import Box
from trieste.data import Dataset
def objective(x, error=True):
y = tf.pow(x, 3)
if error:
y += tf.random.normal(x.shape, 0, 3, dtype=x.dtype)
return y
num_points = 20
# we define the [-4,4] interval using a `Box` search space that has convenient sampling methods
search_space = Box([-4], [4])
inputs = search_space.sample_sobol(num_points)
outputs = objective(inputs)
data = Dataset(inputs, outputs)
Next we define a deep ensemble model and train it. Trieste supports neural network models defined as TensorFlow’s Keras models. Since creating ensemble models in Keras can be somewhat involved, Trieste provides some basic architectures. Here we use the build_keras_ensemble
function which builds a simple ensemble of neural networks in Keras where each network has the same architecture: number of hidden layers, nodes in hidden layers and activation function. It uses sensible defaults for many
parameters and finally returns a model of KerasEnsemble
class.
As with other supported types of models (e.g. Gaussian process models from GPflow), we cannot use KerasEnsemble
directly in Bayesian optimization routines, we need to pass it through an appropriate wrapper, DeepEnsemble
wrapper in this case. One difference with respect to other model types is that we need to use a Keras specific optimizer wrapper KerasOptimizer
where we need to specify a stochastic optimizer (Adam is used by default, but we can use other stochastic optimizers from
TensorFlow), objective function (here negative log likelihood) and we can provide custom arguments for the Keras fit
method (here we modify the default arguments; check Keras API documentation for a list of possible arguments).
For the cubic function toy problem we use the same architecture as in [LPB16]: ensemble size of 5 networks, where each network has one hidden layer with 100 nodes. All other implementation details were missing and we used sensible choices, as well as details about training the network.
[3]:
from trieste.models.keras import (
DeepEnsemble,
KerasPredictor,
build_keras_ensemble,
)
from trieste.models.optimizer import KerasOptimizer
def build_cubic_model(data: Dataset) -> DeepEnsemble:
ensemble_size = 5
num_hidden_layers = 1
num_nodes = 100
keras_ensemble = build_keras_ensemble(
data, ensemble_size, num_hidden_layers, num_nodes
)
fit_args = {
"batch_size": 10,
"epochs": 1000,
"verbose": 0,
}
optimizer = KerasOptimizer(tf.keras.optimizers.Adam(0.01), fit_args)
return DeepEnsemble(keras_ensemble, optimizer)
# building and optimizing the model
model = build_cubic_model(data)
model.optimize(data)
Let’s illustrate the results of the model training. We create a test set that includes points outside the interval on which the model has been trained. These extrapolation points are a good test of model’s representation of uncertainty. What would we expect to see? Bayesian inference provides a reference frame. Predictive uncertainty should increase the farther we are from the training data and the predictive mean should start returning to the prior mean (assuming standard zero mean function).
We can see in the figure below that predictive distribution of deep ensembles indeed exhibits these features. The figure also replicates fairly well Figure 1 (rightmost panel) from [LPB16] and provides a reasonable match to Bayesian neural network trained on same toy problem with Hamiltonian Monte Carlo (golden standard that is usually very expensive) as illustrated in Figure 1 (upper right panel) [HernandezLA15]. This gives us some assurance that deep ensembles might provide uncertainty that is good enough for trading off between exploration and exploitation in Bayesian optimization.
[4]:
import matplotlib.pyplot as plt
# test data that includes extrapolation points
test_points = tf.linspace(-6, 6, 1000)
# generating a plot with ground truth function, mean prediction and 3 standard
# deviations around it
plt.scatter(inputs, outputs, marker=".", alpha=0.6, color="red", label="data")
plt.plot(
test_points, objective(test_points, False), color="blue", label="function"
)
y_hat, y_var = model.predict(test_points)
y_hat_minus_3sd = y_hat - 3 * tf.math.sqrt(y_var)
y_hat_plus_3sd = y_hat + 3 * tf.math.sqrt(y_var)
plt.plot(test_points, y_hat, color="gray", label="model $\mu$")
plt.fill_between(
test_points,
tf.squeeze(y_hat_minus_3sd),
tf.squeeze(y_hat_plus_3sd),
color="gray",
alpha=0.5,
label="$\mu -/+ 3SD$",
)
plt.ylim([-100, 100])
plt.show()
Non-stationary toy problem#
Now we turn to a somewhat more serious synthetic optimization problem. We want to find the minimum of the two-dimensional version of the Michalewicz function. Even though we stated that deep ensembles should be used with larger budget sizes, here we will show them on a small dataset to provide a problem that is feasible for the scope of the tutorial.
The Michalewicz function is defined on the search space of \([0, \pi]^2\). Below we plot the function over this space. The Michalewicz function is interesting case for deep ensembles as it features sharp ridges that are difficult to capture with Gaussian processes. This occurs because lengthscale parameters in typical kernels cannot easily capture both ridges (requiring smaller lengthscales) and fairly flat areas everywhere else (requiring larger lengthscales).
[5]:
from trieste.objectives import Michalewicz2
from trieste.experimental.plotting import plot_function_plotly
search_space = Michalewicz2.search_space
function = Michalewicz2.objective
MINIMUM = Michalewicz2.minimum
MINIMIZER = Michalewicz2.minimum
# we illustrate the 2-dimensional Michalewicz function
fig = plot_function_plotly(
function, search_space.lower, search_space.upper, grid_density=20
)
fig.show()