Module containing a model for importance-weighted variational inference.
Bases: markovflow.models.sparse_variational.SparseVariationalGaussianProcess
Performs importance-weighted variational inference (IWVI).
The key reference is:
@inproceedings{domke2018importance, title={Importance weighting and variational inference}, author={Domke, Justin and Sheldon, Daniel R}, booktitle={Advances in neural information processing systems}, pages={4470--4479}, year={2018} }
The idea is based on the observation that an estimator of the evidence lower bound (ELBO) can be obtained from an importance weight \(w\):
…where \(x\) is the latent variable of the model (a GP, or set of GPs in our case) and the function \(w\) is:
It follows that:
It turns out that there are a series of lower bounds given by taking multiple importance samples:
And we have the relation:
This means that we can improve tightness of the ELBO to the log marginal likelihood by increasing \(n\), which we refer to in this class as num_importance_samples. The trade-offs are:
The objective function is now always stochastic, even for cases where the ELBO of the parent class is non-stochastic We have to do more computations (evaluate the weights \(n\) times)
The objective function is now always stochastic, even for cases where the ELBO of the parent class is non-stochastic
We have to do more computations (evaluate the weights \(n\) times)
kernel – A kernel that defines a prior over functions.
inducing_points – The points in time on which inference should be performed, with shape batch_shape + [num_inducing].
batch_shape + [num_inducing]
likelihood – A likelihood.
num_importance_samples – The number of samples for the importance-weighted estimator.
initial_distribution – An initial configuration for the variational distribution, with shape batch_shape + [num_inducing].
mean_function – The mean function for the GP. Defaults to no mean function.
Compute the importance-weighted ELBO using K samples. The procedure is:
for k=1...K: uₖ ~ q(u) sₖ ~ p(s | u) wₖ = p(y | sₖ)p(uₖ) / q(uₖ) ELBO = log (1/K) Σₖwₖ
Everything is computed in log-space for stability. Note that gradients of this ELBO may have high variance with regard to the variational parameters; see the DREGS gradient estimator method.
input_data –
A tuple of time points and observations containing the data at which to calculate the loss for training the model:
A tensor of inputs with shape batch_shape + [num_data]
batch_shape + [num_data]
A tensor of observations with shape batch_shape + [num_data, observation_dim]
batch_shape + [num_data, observation_dim]
A scalar tensor.
Compute a scalar tensor that, when differentiated using tf.gradients, produces the DREGS variance controlled gradient.
See “Doubly Reparameterized Gradient Estimators For Monte Carlo Objectives” for a derivation.
We recommend using these gradients for training variational parameters and gradients of the importance-weighted ELBO for training hyperparameters.