A GPflow hierarchical kernel over a HierarchicalSearchSpace#
The previous notebook introduced HierarchicalSearchSpace, the conditional / disjunctive search space that gates non-indicator subspaces by indicator values. To do Bayesian optimisation over such a space we need a covariance function that respects the activation structure: a point whose conditional feature is inactive should not be confused with one whose feature is active and equal in value — the two live in fundamentally different regions of the space.
GPflow ships such kernels: gpflow.kernels.ArcHierarchical (the Arc kernel of Swersky et al. 2014) and gpflow.kernels.WedgeHierarchical (the Wedge kernel of Horn et al. 2019). This notebook shows how to feed a Trieste HierarchicalSearchSpace straight into ArcHierarchical and fit a GP — i.e. the integration glue between Trieste’s search-space layer and GPflow’s kernel layer. The only real work is a small coordinate-system adapter, explained below.
[1]:
import gpflow
import numpy as np
import tensorflow as tf
from gpflow.kernels import ArcHierarchical, WedgeHierarchical
from trieste.space import (
BooleanSearchSpace,
Box,
HierarchicalSearchSpace,
HierarchyNode,
)
np.random.seed(1793)
tf.random.set_seed(1793)
2026-06-17 10:48:00,412 INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
Three axioms for a conditional distance#
Let \(\mathbf{x}^{nc}\) be the always-active (unconditional) coordinates and \(\mathbf{x}^{c}\) the indicator-gated (conditional) ones. A useful per-dimension distance \(d_i(\mathbf{x}, \mathbf{x}')\) on a conditional dimension \(i\) should satisfy:
both inactive: \(d_i = 0\) — the value is meaningless on either side.
both active: \(d_i\) is a function of the difference in feature value.
incomparable (one active, one inactive): \(d_i\) is positive and places the two points in distinct regions of the embedded space.
Both ArcHierarchical and WedgeHierarchical enforce these axioms by embedding each conditional dimension into :math:`mathbb{R}^2`: inactive points map to the origin, active points to a value-dependent point off the origin. A stationary base kernel (Matérn-5/2 by default) then evaluates the covariance in the joint embedded space. We do not have to implement any of this — we only have to describe the activation structure to the kernel.
Build the search space#
Canonical four-variable disjunction: \(x_1\) unconditional, \(y_1\) a Boolean indicator, \(x_2\) active when \(y_1 = 1\), \(x_3\) active when \(y_1 = 0\). This is the same space as the previous notebook.
[2]:
subspaces = {
"x1": Box([0.0], [1.0]), # unconditional
"y1": BooleanSearchSpace(), # Boolean indicator
"x2": Box([0.0], [5.0]), # active when y1 = 1
"x3": Box([-1.0], [1.0]), # active when y1 = 0
}
hierarchy = [
HierarchyNode(
"shared",
subspace_tags=["x1"],
),
HierarchyNode(
"branch_A",
subspace_tags=["x2"],
activity_condition_tags={"y1": 1},
),
HierarchyNode(
"branch_B",
subspace_tags=["x3"],
activity_condition_tags={"y1": 0},
),
]
space = HierarchicalSearchSpace(subspaces, hierarchy)
print("dimension: ", int(space.dimension))
print("indicator_tags:", space.indicator_tags)
print("indicator_dims:", space.indicator_dims)
dimension: 4
indicator_tags: ('y1',)
indicator_dims: [1]
From a HierarchicalSearchSpace to a GPflow hierarchy#
ArcHierarchical takes a sequence of gpflow.kernels.HierarchyNode plus an active_dims argument:
ArcHierarchical(hierarchy, base_kernel=None, *, active_dims, name=None)
Each node carries feature_dims (the real-valued columns it owns), feature_bounds (for normalisation), and an ActivityCondition whose requirements map gates those columns. The kernel infers the indicator columns as the union of all requirements keys, slices its input down to active_dims, and validates that the feature columns and the inferred indicator columns together tile the sliced vector contiguously.
HierarchicalSearchSpace.to_gpflow_hierarchy() resolves the tag-based nodes to exactly these gpflow.kernels.HierarchyNode objects: Trieste keys both feature_dims and the requirements map by global flat-vector column (the same convention the kernel uses), so the nodes can be passed straight to ArcHierarchical — no adapter needed. Since every column of an HSS flat vector is either a feature or an indicator, we slice on all columns — active_dims = range(dimension) — so
the sliced and global coordinate systems coincide.
[3]:
active_dims = list(range(int(space.dimension)))
for node in space.to_gpflow_hierarchy():
print(
f"{node.name:9s} feature_dims={list(node.feature_dims)} "
f"requirements={dict(node.activity_condition.requirements)}"
)
print("active_dims:", active_dims)
shared feature_dims=[0] requirements={}
branch_A feature_dims=[2] requirements={1: 1}
branch_B feature_dims=[3] requirements={1: 0}
active_dims: [0, 1, 2, 3]
Construct the Arc kernel#
With the hierarchy and active_dims in hand, the kernel is a one-liner. (The hierarchical kernels are flagged @experimental in GPflow 2.11, so constructing one emits a UserWarning — that is expected.)
[4]:
arc = ArcHierarchical(space.to_gpflow_hierarchy(), active_dims=active_dims)
print(type(arc).__name__, "built over active_dims", active_dims)
ArcHierarchical built over active_dims [0, 1, 2, 3]
/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/gpflow/experimental/utils.py:42: UserWarning: You're calling gpflow.kernels.hierarchical.HierarchicalEmbeddingKernel.__init__ which is considered *experimental*. Expect: breaking changes, poor documentation, and bugs.
warn(
Inspect the covariance on a hand-crafted batch#
The flat-vector layout is [x1, y1, x2, x3]. The two rows below share \(x_1 = 0.5\) and the same stored \(x_2 = 2.5\), but differ in the indicator: the first has \(y_1 = 1\) (so \(x_2\) is active, \(x_3\) inactive) and the second has \(y_1 = 0\) (so \(x_3\) is active, \(x_2\) inactive). A conditional kernel must place them in different regions, so their cross-covariance is well below the unit diagonal.
[5]:
X_demo = tf.constant(
[
[0.5, 1.0, 2.5, 0.0], # y1 = 1: x1 + x2 active
[0.5, 0.0, 2.5, 0.0], # y1 = 0: x1 + x3 active
],
dtype=tf.float64,
)
print("K(X_demo):")
print(arc.K(X_demo).numpy())
K(X_demo):
[[1. 0.31728336]
[0.31728336 1. ]]
Worked example: fit a GP on a synthetic disjunctive function#
Synthetic objective whose two branches carry different signal:
The conditional kernel must “switch off” the inactive branch’s contribution to similarity for that to be learnable from a finite sample.
[6]:
def objective(X):
X = np.asarray(X)
x1, y1, x2, x3 = X[:, 0], X[:, 1], X[:, 2], X[:, 3]
return (
np.sin(2.0 * np.pi * x1)
+ (y1 > 0.5).astype(float) * 0.5 * np.cos(np.pi * x2 / 5.0)
+ (y1 < 0.5).astype(float) * 0.5 * x3
).reshape(-1, 1)
X_train = space.sample(40).numpy()
Y_train = objective(X_train) + 0.05 * np.random.randn(40, 1)
# Wrap in a Constant() factor so the GP can learn an overall variance.
kernel = gpflow.kernels.Constant() * ArcHierarchical(
space.to_gpflow_hierarchy(), active_dims=active_dims
)
gpr = gpflow.models.GPR(
data=(X_train, Y_train), kernel=kernel, noise_variance=0.05
)
print(f"LML before fit: {gpr.log_marginal_likelihood().numpy():+.3f}")
gpflow.optimizers.Scipy().minimize(
gpr.training_loss, gpr.trainable_variables, options={"maxiter": 100}
)
print(f"LML after fit: {gpr.log_marginal_likelihood().numpy():+.3f}")
LML before fit: -47.427
LML after fit: +20.687
Predict on held-out points#
We evaluate on a small held-out batch covering both branches. The kernel places the two test rows on different parts of the embedded space, so their predicted means follow the corresponding branch’s signal.
[7]:
X_test = np.array(
[
[0.3, 1.0, 2.0, 0.0], # y1 = 1
[0.3, 0.0, 0.0, 0.5], # y1 = 0
[0.7, 1.0, 4.0, 0.0], # y1 = 1
[0.7, 0.0, 0.0, -0.4], # y1 = 0
]
)
mean, var = gpr.predict_f(X_test)
print("test predictions vs ground truth:")
truth = objective(X_test).ravel()
for x, m, v, t in zip(X_test, mean.numpy().ravel(), var.numpy().ravel(), truth):
print(
f" x = {x.tolist()} -> mean = {m:+.3f} var = {v:.3f} truth = {t:+.3f}"
)
test predictions vs ground truth:
x = [0.3, 1.0, 2.0, 0.0] -> mean = +1.076 var = 0.001 truth = +1.106
x = [0.3, 0.0, 0.0, 0.5] -> mean = +1.266 var = 0.001 truth = +1.201
x = [0.7, 1.0, 4.0, 0.0] -> mean = -1.404 var = 0.001 truth = -1.356
x = [0.7, 0.0, 0.0, -0.4] -> mean = -1.193 var = 0.001 truth = -1.151
Swapping in the Wedge kernel#
WedgeHierarchical has the identical constructor, so switching embeddings is a one-line change. Its “incomparable” distance scales with the active value \(v_c\) rather than being constant in it — often closer to what a practitioner expects near disjunction boundaries.
[8]:
wedge = WedgeHierarchical(space.to_gpflow_hierarchy(), active_dims=active_dims)
print("Wedge kernel matrix on demo points:")
print(wedge.K(X_demo).numpy())
Wedge kernel matrix on demo points:
[[1. 0.52399411]
[0.52399411 1. ]]
What this shows#
A conditional GP over a disjunctive space needs only:
BooleanSearchSpace,Box,HierarchicalSearchSpace,HierarchyNodefromtrieste.spaceto describe the structure;space.to_gpflow_hierarchy()passed straight to the kernel — Trieste keys the hierarchy in GPflow’s column convention, so no adapter is needed;gpflow.kernels.ArcHierarchical(orWedgeHierarchical) for the covariance.
No bespoke kernel code is required — the hierarchy carried by HierarchicalSearchSpace is exactly the information GPflow’s hierarchical kernels consume.