Recovering from errors during optimization#
[1]:
import numpy as np
import tensorflow as tf
import random
np.random.seed(1793)
tf.random.set_seed(1793)
random.seed(3)
Sometimes the Bayesian optimization process encounters an error from which we can recover, without the need to restart the run from the beginning. In this tutorial, we’ll simulate such an error and show how to recover from it.
We’ll use a similar setup to the EI notebook, but use an observer that intermittently breaks when evaluated, and needs manual attention to get running again. We can simulate fixing the observer with its manual_fix
method.
[2]:
import trieste
from trieste.objectives import Branin
class FaultyBranin:
def __init__(self):
self._is_broken = False
def manual_fix(self):
self._is_broken = False
def __call__(self, x):
if random.random() < 0.05:
self._is_broken = True
if self._is_broken:
raise Exception("Observer is broken")
return trieste.data.Dataset(x, Branin.objective(x))
observer = FaultyBranin()
/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/gpflow/experimental/utils.py:43: UserWarning: You're calling gpflow.experimental.check_shapes.decorator.check_shapes which is considered *experimental*. Expect: breaking changes, poor documentation, and bugs.
f"You're calling {name} which is considered *experimental*."
/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/gpflow/experimental/utils.py:43: UserWarning: You're calling gpflow.experimental.check_shapes.inheritance.inherit_check_shapes which is considered *experimental*. Expect: breaking changes, poor documentation, and bugs.
f"You're calling {name} which is considered *experimental*."
Set up the problem#
We’ll use the same set up as before, except for the acquisition rule, where we’ll use TrustRegion
. TrustRegion
is stateful, and we’ll need to account for its state to recover, so using this rule gives the reader a more comprehensive overview of how to recover.
[3]:
from trieste.models.gpflow import build_gpr, GaussianProcessRegression
search_space = trieste.space.Box(
tf.cast([0.0, 0.0], tf.float64), tf.cast([1.0, 1.0], tf.float64)
)
initial_data = observer(search_space.sample(5))
gpr = build_gpr(initial_data, search_space)
model = GaussianProcessRegression(gpr)
acquisition_rule = trieste.acquisition.rule.TrustRegion()
/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/gpflow/experimental/utils.py:43: UserWarning: You're calling gpflow.experimental.check_shapes.checker.ShapeChecker.__init__ which is considered *experimental*. Expect: breaking changes, poor documentation, and bugs.
f"You're calling {name} which is considered *experimental*."
Run the optimization loop#
In this tutorial we’ll try to complete fifteen optimization loops, which, with the broken observer, may take more than one attempt. The optimizer returns an OptimizationResult
, which is simply a container for both:
the final result, which uses a
Result
type (not to be confused withOptimizationResult
) to safely encapsulate the final data, models and acquisition state if the process completed successfully, or an error if one occurredthe history of the successful optimization steps.
We can access these with the astuple
method.
[4]:
bo = trieste.bayesian_optimizer.BayesianOptimizer(observer, search_space)
num_steps = 15
result, history = bo.optimize(
num_steps, initial_data, model, acquisition_rule, None
).astuple()
Optimization failed at step 6, encountered error with traceback:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/trieste/bayesian_optimizer.py", line 688, in optimize
observer_output = self._observer(query_points)
File "/tmp/ipykernel_8830/3852074970.py", line 17, in __call__
raise Exception("Observer is broken")
Exception: Observer is broken
Terminating optimization and returning the optimization history. You may be able to use the history to restart the process from a previous successful optimization step.
We can see from the logs that the optimization loop failed, and this can be sufficient to know what to do next if we’re working in a notebook. However, sometimes our setup means we don’t have access to the logs. We’ll pretend from here that’s the case.
Handling success#
We don’t know if the optimization completed successfully or not, so we’ll only try to access and plot the data if it was successful. We can find out if this was the case with result
’s is_ok
attribute. If it was successful, we know there is data in the result
, which we can unwrap
and view.
[5]:
if result.is_ok:
data = result.unwrap().dataset
print("best observation: ", tf.reduce_min(data.observations))
Handling failure#
If on the other hand, the optimization didn’t complete successfully, we can fix our observer, and try again. We can try again by using the data, model and acquisition state from the last successful step, which is the last element of the history
. Recall that we only need to account for the acquisition state because we’re using the stateful TrustRegion
rule. For most rules, we don’t need to account for this state.
Note can view any Result
by printing it. We’ll do that here to see what exception was caught.
[6]:
if result.is_err:
print("result: ", result)
observer.manual_fix()
result, new_history = bo.optimize(
15 - len(history),
history[-1].dataset,
history[-1].model,
acquisition_rule,
history[-1].acquisition_state,
).astuple()
history.extend(new_history)
result: Err(Exception('Observer is broken'))
Optimization completed without errors
We can repeat this until we’ve spent our optimization budget, using a loop if appropriate. But here, we’ll just plot the data if it exists, safely by using result
’s is_ok
attribute.
[7]:
from trieste.experimental.plotting import plot_bo_points, plot_function_2d
if result.is_ok:
data = result.unwrap().dataset
arg_min_idx = tf.squeeze(tf.argmin(data.observations, axis=0))
_, ax = plot_function_2d(
Branin.objective,
search_space.lower,
search_space.upper,
30,
contour=True,
)
plot_bo_points(data.query_points.numpy(), ax[0, 0], 5, arg_min_idx)
Saving results to disk#
For convenience, tracked state is stored in memory by default. However, this can potentially result in Out of Memory errors and also makes it difficult to recover from intentional or unintentional Python process shutdowns. You can instead store the result on disk by passing in a track_path
argument to optimize
.
Note that trieste currently saves models using pickling, which is not portable and not secure. You should only try to load optimization results that you generated yourself on the same system (or a system with the same version libraries).
[8]:
result, history = bo.optimize(
num_steps, initial_data, model, acquisition_rule, None, track_path="history"
).astuple()
Optimization failed at step 10, encountered error with traceback:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/trieste/bayesian_optimizer.py", line 688, in optimize
observer_output = self._observer(query_points)
File "/tmp/ipykernel_8830/3852074970.py", line 17, in __call__
raise Exception("Observer is broken")
Exception: Observer is broken
Terminating optimization and returning the optimization history. You may be able to use the history to restart the process from a previous successful optimization step.
The returned history
records are now stored in files rather than in memory. Their constituents can be accessed just as before, which loads the content into memory only when required. The result
is automatically loaded into memory, but is also saved to disk with the rest of the history.
[9]:
print(history[-1])
print(history[-1].model)
FrozenRecord(path=PosixPath('history/step.10.pickle'))
GaussianProcessRegression(<gpflow.models.gpr.GPR object at 0x7f6f2de93650>, Optimizer(optimizer=<gpflow.optimizers.scipy.Scipy object at 0x7f6f2d68ae10>, minimize_args={}, compile=True),10, 1000,True)
It is also possible to reload the OptimizationResult
in a new Python process:
[10]:
trieste.bayesian_optimizer.OptimizationResult.from_path("history")
[10]:
OptimizationResult(final_result=Err(Exception('Observer is broken')), history=[FrozenRecord(path=PosixPath('history/step.01.pickle')), FrozenRecord(path=PosixPath('history/step.02.pickle')), FrozenRecord(path=PosixPath('history/step.03.pickle')), FrozenRecord(path=PosixPath('history/step.04.pickle')), FrozenRecord(path=PosixPath('history/step.05.pickle')), FrozenRecord(path=PosixPath('history/step.06.pickle')), FrozenRecord(path=PosixPath('history/step.07.pickle')), FrozenRecord(path=PosixPath('history/step.08.pickle')), FrozenRecord(path=PosixPath('history/step.09.pickle')), FrozenRecord(path=PosixPath('history/step.10.pickle'))])
Out of memory errors#
Since Out Of Memory errors normally result in the Python process shutting down, saving tracked state to disk as described above is an important tool in recovering from them. One possible cause of memory errors is trying to evaluate an acquisition function over a large dataset, e.g. when initializing our gradient-based optimizers. To work around this, you can specify that evaluations of the acquisition function be split up: this splits them (on the first dimension) into batches of a given size, then stitches them back together. To do this, you need to provide an explicit split optimizer and specify a desired batch size.
[11]:
from trieste.acquisition.optimizer import automatic_optimizer_selector
from trieste.acquisition.rule import EfficientGlobalOptimization
from trieste.acquisition.utils import split_acquisition_function_calls
optimizer = split_acquisition_function_calls(
automatic_optimizer_selector, split_size=10_000
)
query_rule = EfficientGlobalOptimization(optimizer=optimizer)
acquisition_rule = trieste.acquisition.rule.TrustRegion(rule=query_rule)