markovflow.models
Package containing ready-to-use GP models.
markovflow.models.gaussian_process_regression
markovflow.models.iwvi
markovflow.models.models
markovflow.models.pep
markovflow.models.sparse_pep
markovflow.models.sparse_variational
markovflow.models.sparse_variational_cvi
markovflow.models.spatio_temporal_variational
markovflow.models.variational
markovflow.models.variational_cvi
GaussianProcessRegression
Bases: markovflow.models.models.MarkovFlowModel
markovflow.models.models.MarkovFlowModel
Performs GP regression.
The key reference is Chapter 2 of:
Gaussian Processes for Machine Learning Carl Edward Rasmussen and Christopher K. I. Williams The MIT Press, 2006. ISBN 0-262-18253-X.
This class uses the kernel and the time points to create a state space model. GP regression is then a Kalman filter on that state space model using the observations.
kernel – A kernel defining a prior over functions.
input_data – A tuple of (time_points, observations) containing the observed data: time points of observations, with shape batch_shape + [num_data], observations with shape batch_shape + [num_data, observation_dim].
(time_points, observations)
batch_shape + [num_data]
batch_shape + [num_data, observation_dim]
chol_obs_covariance – A TensorType containing the Cholesky factor of the observation noise covariance, with shape [observation_dim, observation_dim]. a default None value will assume independent likelihood variance of 1.0
TensorType
[observation_dim, observation_dim]
mean_function – The mean function for the GP. Defaults to no mean function.
time_points
Return the time points of observations.
A tensor with shape batch_shape + [num_data].
observations
Return the observations.
A tensor with shape batch_shape + [num_data, observation_dim].
kernel
Return the kernel of the GP.
mean_function
Return the mean function of the GP.
loss
Return the loss, which is the negative log likelihood.
posterior
Obtain a posterior process for inference.
For this class, this is the AnalyticPosteriorProcess built from the Kalman filter.
AnalyticPosteriorProcess
log_likelihood
Calculate the log likelihood of the observations given the kernel parameters.
In other words, \(log p(y_{1...T} | ϑ)\) for some parameters \(ϑ\).
A scalar tensor (summed over the batch shape and the whole trajectory).
ImportanceWeightedVI
Bases: markovflow.models.sparse_variational.SparseVariationalGaussianProcess
markovflow.models.sparse_variational.SparseVariationalGaussianProcess
Performs importance-weighted variational inference (IWVI).
The key reference is:
@inproceedings{domke2018importance, title={Importance weighting and variational inference}, author={Domke, Justin and Sheldon, Daniel R}, booktitle={Advances in neural information processing systems}, pages={4470--4479}, year={2018} }
The idea is based on the observation that an estimator of the evidence lower bound (ELBO) can be obtained from an importance weight \(w\):
…where \(x\) is the latent variable of the model (a GP, or set of GPs in our case) and the function \(w\) is:
It follows that:
…and:
It turns out that there are a series of lower bounds given by taking multiple importance samples:
And we have the relation:
This means that we can improve tightness of the ELBO to the log marginal likelihood by increasing \(n\), which we refer to in this class as num_importance_samples. The trade-offs are:
num_importance_samples
The objective function is now always stochastic, even for cases where the ELBO of the parent class is non-stochastic We have to do more computations (evaluate the weights \(n\) times)
The objective function is now always stochastic, even for cases where the ELBO of the parent class is non-stochastic
We have to do more computations (evaluate the weights \(n\) times)
kernel – A kernel that defines a prior over functions.
inducing_points – The points in time on which inference should be performed, with shape batch_shape + [num_inducing].
batch_shape + [num_inducing]
likelihood – A likelihood.
num_importance_samples – The number of samples for the importance-weighted estimator.
initial_distribution – An initial configuration for the variational distribution, with shape batch_shape + [num_inducing].
elbo
Compute the importance-weighted ELBO using K samples. The procedure is:
for k=1...K: uₖ ~ q(u) sₖ ~ p(s | u) wₖ = p(y | sₖ)p(uₖ) / q(uₖ) ELBO = log (1/K) Σₖwₖ
Everything is computed in log-space for stability. Note that gradients of this ELBO may have high variance with regard to the variational parameters; see the DREGS gradient estimator method.
input_data –
A tuple of time points and observations containing the data at which to calculate the loss for training the model:
A tensor of inputs with shape batch_shape + [num_data]
A tensor of observations with shape batch_shape + [num_data, observation_dim]
A scalar tensor.
dregs_objective
Compute a scalar tensor that, when differentiated using tf.gradients, produces the DREGS variance controlled gradient.
tf.gradients
See “Doubly Reparameterized Gradient Estimators For Monte Carlo Objectives” for a derivation.
We recommend using these gradients for training variational parameters and gradients of the importance-weighted ELBO for training hyperparameters.
MarkovFlowModel
Bases: tf.Module, abc.ABC
tf.Module
abc.ABC
Abstract class representing Markovflow models that depend on input data.
All Markovflow models are TensorFlow Modules, so it is possible to obtain trainable variables via the trainable_variables attribute. You can combine this with the loss() method to train the model. For example:
TensorFlow Modules
trainable_variables
loss()
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=0.01) for i in range(iterations): model.optimization_step(optimizer)
Call the predict_f() method to predict marginal function values at future time points. For example:
predict_f()
mean, variance = model.predict_f(validation_data_tensor)
Note
Markovflow models that extend this class must implement the loss() method and posterior attribute.
Obtain the loss, which you can use to train the model. It should always return a scalar.
NotImplementedError – Must be implemented in derived classes.
Return a posterior process from the model, which can be used for inference.
predict_state
Predict state at new_time_points. Note these time points should be sorted.
new_time_points
new_time_points – Time points to generate observations for, with shape batch_shape + [num_new_time_points,].
batch_shape + [num_new_time_points,]
Predicted mean and covariance for the new time points, with respective shapes batch_shape + [num_new_time_points, state_dim] batch_shape + [num_new_time_points, state_dim, state_dim].
batch_shape + [num_new_time_points, state_dim]
batch_shape + [num_new_time_points, state_dim, state_dim]
predict_f
Predict marginal function values at new_time_points. Note these time points should be sorted.
new_time_points – Time points to generate observations for, with shape batch_shape + [num_new_time_points].
batch_shape + [num_new_time_points]
full_output_cov – Either full output covariance (True) or marginal variances (False).
True
False
Predicted mean and covariance for the new time points, with respective shapes batch_shape + [num_new_time_points, output_dim] and either batch_shape + [num_new_time_points, output_dim, output_dim] or batch_shape + [num_new_time_points, output_dim].
batch_shape + [num_new_time_points, output_dim]
batch_shape + [num_new_time_points, output_dim, output_dim]
MarkovFlowSparseModel
Abstract class representing Markovflow models that do not need to store the training data (\(X, Y\)) in the model to approximate the posterior predictions \(p(f*|X, Y, x*)\).
This currently applies only to sparse variational models.
The optimization_step method should typically be used to train the model. For example:
optimization_step
input_data = (tf.constant(time_points), tf.constant(observations)) optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=0.01) for i in range(iterations): model.optimization_step(input_data, optimizer)
Obtain the loss, which can be used to train the model.
Obtain a posterior process from the model, which can be used for inference.
full_output_cov – Either full output covariance (True) or marginal variances (FalseF).
FalseF
predict_log_density
Compute the log density of the data. That is:
Predicted log density at input time points, with shape batch_shape + [num_data].
PowerExpectationPropagation
Bases: markovflow.models.variational_cvi.GaussianProcessWithSitesBase
markovflow.models.variational_cvi.GaussianProcessWithSitesBase
This is an approximate inference called Power Expectation Propagation.
Approximates a the posterior of a model with GP prior and a general likelihood using a Gaussian posterior parameterized with Gaussian sites.
The following notation is used:
x - the time points of the training data. y - observations corresponding to time points x. s(.) - the latent state of the Markov chain f(.) - the noise free predictions of the model p(y | f) - the likelihood t(f) - a site (indices will refer to the associated data point) p(.) the prior distribution q(.) the variational distribution
x - the time points of the training data.
y - observations corresponding to time points x.
s(.) - the latent state of the Markov chain
f(.) - the noise free predictions of the model
p(y | f) - the likelihood
t(f) - a site (indices will refer to the associated data point)
p(.) the prior distribution
q(.) the variational distribution
We use the state space formulation of Markovian Gaussian Processes that specifies: the conditional density of neighbouring latent states: p(xₖ₊₁| xₖ) how to read out the latent process from these states: fₖ = H xₖ
The likelihood links data to the latent process and p(yₖ | fₖ). We would like to approximate the posterior over the latent state space model of this model.
We parameterize the joint posterior using sites tₖ(fₖ)
p(x, y) = p(x) ∏ₖ tₖ(fₖ)
where tₖ(fₖ) are univariate Gaussian sites parameterized in the natural form
t(f) = exp(𝞰ᵀφ(f) - A(𝞰)), where 𝞰=[η₁,η₂] and 𝛗(f)=[f,f²]
(note: the subscript k has been omitted for simplicity)
The site update of the sites are given by the classic EP update rules as described in:
title={Expectation propagation for exponential families}, author={Seeger, Matthias}, year={2005}
}
likelihood – A likelihood. with shape batch_shape + [num_inducing].
learning_rate – the learning rate of the algorithm
alpha – the power as in Power Expectation propagation
local_objective
Local objective of the PEP algorithm : log E_q(f) p(y|f)ᵃ
local_objective_gradients
Gradients of the local objective of the PEP algorithm wrt to the predictive mean
mask_indices
Binary mask (cast to float), 0 for the excluded indices, 1 for the rest
compute_cavity_from_marginals
Compute cavity from marginals :param marginals: list of tensors
compute_cavity
The cavity distributions for all data points. This corresponds to the marginal distribution qᐠⁿ(fₙ) of qᐠⁿ(f) = q(f)/tₙ(fₙ)ᵃ
compute_log_norm
Compute log normalizer
update_sites
Compute the site updates and perform one update step :param site_indices: list of indices to be updated
Computes the marginal log marginal likelihood of the approximate joint p(s, y)
energy
PEP Energy
Compute the log density of the data at the new data points.
input_data – A tuple of time points and observations containing the data at which to calculate the loss for training the model: a tensor of inputs with shape batch_shape + [num_data], a tensor of observations with shape batch_shape + [num_data, observation_dim].
SparsePowerExpectationPropagation
Bases: markovflow.models.models.MarkovFlowSparseModel
markovflow.models.models.MarkovFlowSparseModel
This is the Sparse Power Expectation Propagation Algorithm
Approximates a the posterior of a model with GP prior and a general likelihood using a Gaussian posterior parameterized with Gaussian sites on inducing states u at inducing points z.
x - the time points of the training data. z - the time points of the inducing/pseudo points. y - observations corresponding to time points x. s(.) - the continuous time latent state process u = s(z) - the discrete inducing latent state space model f(.) - the noise free predictions of the model p(y | f) - the likelihood t(u) - a site (indices will refer to the associated data point) p(.) the prior distribution q(.) the variational distribution
z - the time points of the inducing/pseudo points.
s(.) - the continuous time latent state process
u = s(z) - the discrete inducing latent state space model
t(u) - a site (indices will refer to the associated data point)
We use the state space formulation of Markovian Gaussian Processes that specifies: the conditional density of neighbouring latent states: p(sₖ₊₁| sₖ) how to read out the latent process from these states: fₖ = H sₖ
To approximate the posterior, we maximise the evidence lower bound (ELBO) (ℒ) with respect to the parameters of the variational distribution, since:
log p(y) = ℒ(q) + KL[q(s) ‖ p(s | y)]
…where:
ℒ(q) = ∫ log(p(s, y) / q(s)) q(s) ds
We parameterize the variational posterior through M sites tₘ(vₘ)
q(s) = p(s) ∏ₘ tₘ(vₘ)
where tₘ(vₘ) are multivariate Gaussian sites on vₘ = [uₘ, uₘ₊₁], i.e. consecutive inducing states.
The sites are parameterized in the natural form
t(v) = exp(𝜽ᵀφ(v) - A(𝜽)), where 𝜽=[θ₁, θ₂] and 𝛗(u)=[v, vᵀv]
with 𝛗(v) are the sufficient statistics and 𝜽 the natural parameters
learning_rate – the learning rate
alpha – power as in Power Expectation Propagation
Posterior Process
Binary mask to exclude data indices :param exclude_indices:
back_project_nats
back project natural gradient associated to time points to their associated inducing sites.
fraction_sites
for all segment indexed m of consecutive inducing points [z_m, z_m+1[, this counts the time points t falling in that segment: c(m) = #{t, z_m <= t < z_m+1} and returns 1/c(m) or 0 when c(m)=0
time_points – tensor of shape batch_shape + [num_data]
tensor of shape batch_shape + [num_data]
compute_posterior_ssm
Computes the variational posterior distribution on the vector of inducing states
dist_q
compute_marginals
Compute pairwise marginals
remove_cavity_from_marginals
Remove cavity from marginals :param time_points: :param marginals: pairwise mean and covariance tensors
compute_cavity_state
The cavity distributions for data points at input time_points. This corresponds to the marginal distribution qᐠⁿ(fₙ) of qᐠⁿ(s) = q(s)/tₘ(vₘ)ᵝᵃ, where β = a * (1 / #time points touching site tₘ)
touching
Cavity on f :param time_points: time points
compute_new_sites
Compute the site updates and perform one update step. :param input_data: A tuple of time points and observations containing the data from which
to calculate the the updates: a tensor of inputs with shape batch_shape + [num_data], a tensor of observations with shape batch_shape + [num_data, observation_dim].
compute_num_data_per_interval
compute fraction of site per data point
compute_fraction
apply updates
The PEP energy : ∫ ds p(s) 𝚷_m t_m(v_m) :param input_data: input data
Return the loss, which is the negative evidence lower bound (ELBO).
input_data – A tuple of time points and observations containing the data at which to calculate the loss for training the model.
dist_p
Return the prior GaussMarkovDistribution.
GaussMarkovDistribution
classic_elbo
ℒ(q) = Σᵢ ∫ log(p(yᵢ | f)) q(f) df - KL[q(f) ‖ p(f)]
Note: this is mostly for testing purposes and not to be used for optimization
input_data – A tuple of time points and observations
A scalar tensor representing the ELBO.
SparseVariationalGaussianProcess
Approximate a GaussMarkovDistribution with a general likelihood using a Gaussian posterior. Additionally uses a number of pseudo, or inducing, points to represent the distribution over a typically larger number of data points.
\(x\) - the time points of the training data \(z\) - the time points of the inducing/pseudo points \(y\) - observations corresponding to time points \(x\) \(s(.)\) - the latent state of the Markov chain \(f(.)\) - the noise free predictions of the model \(p(y | f)\) - the likelihood \(p(.)\) - the true distribution \(q(.)\) - the variational distribution
\(x\) - the time points of the training data
\(z\) - the time points of the inducing/pseudo points
\(y\) - observations corresponding to time points \(x\)
\(s(.)\) - the latent state of the Markov chain
\(f(.)\) - the noise free predictions of the model
\(p(y | f)\) - the likelihood
\(p(.)\) - the true distribution
\(q(.)\) - the variational distribution
Subscript is used to denote dependence for notational convenience, for example \(fₖ === f(k)\).
With a prior generative model comprising a Gauss-Markov distribution, an emission model and an arbitrary likelihood on the emitted variables, these define:
\(p(xₖ₊₁| xₖ)\) \(fₖ = H xₖ\) \(p(yₖ | fₖ)\)
\(p(xₖ₊₁| xₖ)\)
\(fₖ = H xₖ\)
\(p(yₖ | fₖ)\)
As per a VariationalGaussianProcess (VGP) model, we have:
VariationalGaussianProcess
…where \(f\) is defined over the entire function space.
Here this reduces to the joint of the evidence lower bound (ELBO) defined over both the data \(x\) and the inducing points \(z\), which we rewrite as:
This turns the inference problem into an optimisation problem: find the optimal \(q\).
The first term is the variational expectations and have the same form as a VGP model. However, we must now use use the inducing states to predict the marginals of the variational distribution at the original data points.
The second is the KL from the prior to the approximation, but evaluated at the inducing points.
@inproceedings{, title={Doubly Sparse Variational Gaussian Processes}, author={Adam, Eleftheriadis, Artemev, Durrande, Hensman}, booktitle={}, pages={}, year={}, organization={} }
Since this class extends MarkovFlowSparseModel, it does not depend on input data. Input data is passed during the optimisation step as a tuple of time points and observations.
num_data – The total number of observations. (relevant when feeding in external minibatches).
Calculates the evidence lower bound (ELBO) \(log p(y)\). We rewrite this as:
The first term is the ‘variational expectation’ (VE), and has the same form as per a VariationalGaussianProcess (VGP) model. However, we must now use the inducing states to predict the marginals of the variational distribution at the original data points.
The second is the KL divergence from the prior to the approximation, but evaluated at the inducing points.
A scalar tensor (summed over the batch_shape dimension) representing the ELBO.
Return the time points of the sparse process which essentially are the locations of the inducing points.
A tensor with shape batch_shape + [num_inducing]. Same as inducing inputs.
likelihood
Return the likelihood of the GP.
Return the prior Gauss-Markov distribution.
Return the variational distribution as a Gauss-Markov distribution.
For this class this is the AnalyticPosteriorProcess built from the variational distribution. This will be a locally optimal variational approximation of the posterior after optimisation.
A tensor of observations with shape batch_shape + [num_data, observation_dim].
SparseCVIGaussianProcess
This is an alternative parameterization to the SparseVariationalGaussianProcess
t(v) = exp(𝜽ᵀφ(v) - A(𝜽)), where 𝜽=[θ₁, θ₂] and 𝛗(u)=[Wv, WᵀvᵀvW]
with 𝛗(v) are the sufficient statistics and 𝜽 the natural parameters and W is the projection of the conditional mean E_p(f|v)[f] = W v
Each data point indexed k contributes a fraction of the site it belongs to. If vₘ = [uₘ, uₘ₊₁], and zₘ < xₖ <= zₘ₊₁, then xₖ belongs to vₘ.
belongs
The natural gradient update of the sites are similar to that of the CVIGaussianProcess except that they apply to a different parameterization of the sites
learning_rate – the learning rate.
𝜽ₘ ← ρ𝜽ₘ + (1-ρ)𝐠ₘ
Here 𝐠ₘ are the sum of the gradient of the variational expectation for each data point indexed k, projected back to the site vₘ, through the conditional p(fₖ|vₘ) :param input_data: A tuple of time points and observations
Obtain a Tensor representing the loss, which can be used to train the model.
Tensor
Posterior object to predict outside of the training time points
local_objective_and_gradients
Returs the local_objective and its gradients wrt to the expectation parameters :param Fmu: means μ […, latent_dim] :param Fvar: variances σ² […, latent_dim] :param Y: observations Y […, observation_dim] :return: local objective and gradient wrt [μ, σ² + μ²]
local loss in CVI :param Fmu: means […, latent_dim] :param Fvar: variances […, latent_dim] :param Y: observations […, observation_dim] :return: local objective […]
SpatioTemporalSparseVariational
Bases: SpatioTemporalBase
SpatioTemporalBase
Model for Variational Spatio-temporal GP regression using a factor kernel k_space_time((s,t),(s’,t’)) = k_time(t,t’) * k_space(s,s’)
where k_time is a Markovian kernel.
The following notation is used: * X=(x,t) - the space-time points of the training data. * zₛ - the space inducing/pseudo points. * zₜ - the time inducing/pseudo points. * y - observations corresponding to points X. * f(.,.) the spatio-temporal process * x(.,.) the SSM formulation of the spatio-temporal process * u(.) = x(zₛ,.) - the spatio-temporal SSM marginalized at zₛ * p(y | f) - the likelihood * p(.) the prior distribution * q(.) the variational distribution
This can be seen as the temporal extension of gpflow.SVGP, where instead of fixed inducing variables u, they are now time dependent u(t) and follow a Markov chain.
for a fixed set of spatial inducing inputs zₛ p(x(zₛ, .)) is a continuous time process of state dimension Mₛd for a fixed time slice t, p(x(.,t)) ~ GP(0, kₛ)
The following conditional independence holds: p(x(s,t) | x(zₛ, .)) = p(x(s,t) | s(zₛ, t)), i.e., prediction at a new point at time t given x(zₛ, .) only depends on s(zₛ, t)
This builds a spatially sparse process as q(x(.,.)) = q(x(zₛ, .)) p(x(.,.) |x(zₛ, .)), where the multi-output temporal process q(x(zₛ, .)) is also sparse q(x(zₛ, .)) = q(x(zₛ, zₜ)) p(x(zₛ,.) |x(zₛ, zₜ))
the marginal q(x(zₛ, zₜ)) is a multivariate Gaussian distribution parameterized as a state space model.
inducing_space – inducing space points [Ms, D]
inducing_time – inducing time points [Mt,]
kernel_space – Gpflow space kernel
kernel_time – Markovflow time kernel
likelihood – a likelihood object
num_data – number of observations
Posterior state space model on inducing states
Prior state space model on inducing states
Posterior process
SpatioTemporalSparseCVI
Model for Spatio-temporal GP regression using a factor kernel k_space_time((s,t),(s’,t’)) = k_time(t,t’) * k_space(s,s’)
This can be seen as the spatial extension of markovflow’s SparseCVIGaussianProcess for temporal (only) Gaussian Processes. The inducing variables u(x,t) are now space and time dependent.
for a fixed set of space points zₛ p(x(zₛ, .)) is a continuous time process of state dimension Mₛd for a fixed time slice t, p(x(.,t)) ~ GP(0, kₛ)
the marginal q(x(zₛ, zₜ)) is parameterized as the product q(x(zₛ, zₜ)) = p(x(zₛ, zₜ)) t(x(zₛ, zₜ)) where p(x(zₛ, zₜ)) is a state space model and t(x(zₛ, zₜ)) are sites.
Computes the prior distribution on the vector of inducing states
projection_inducing_states_to_observations
Compute the projection matrix from of the conditional mean of f(x,t) | s(t) :param input_data: Time point and associated spatial dimension to generate observations for,
with shape batch_shape + [space_dim + 1, num_time_points].
batch_shape + [space_dim + 1, num_time_points]
The projection matrix with shape [num_time_points, obs_dim, num_inducing_time x state_dim ]
Here 𝐠ₘ are the sum of the gradient of the variational expectation for each data point indexed k, projected back to the site vₘ = [uₘ, uₘ₊₁], through the conditional p(fₖ|vₘ) :param input_data: A tuple of time points and observations
Approximates a GaussMarkovDistribution with a general likelihood using a Gaussian posterior.
\(x\) - the time points of the training data \(y\) - observations corresponding to time points \(x\) \(s(.)\) - the latent state of the Markov chain \(f(.)\) - the noise free predictions of the model \(p(y | f)\) - the likelihood \(p(.)\) - the true distribution \(q(.)\) - the variational distribution
We would like to approximate the posterior of this generative model with a parametric model \(q\), comprising of the same distribution as the prior.
To approximate the posterior, we maximise the evidence lower bound (ELBO) \(ℒ\) with respect to the parameters of the variational distribution, since:
Since the last term is non-negative, the ELBO provides a lower bound to the log-likelihood of the model. This bound is exact when \(KL[q ‖ p(f | y)] = 0\); that is, our approximation is sufficiently flexible to capture the true posterior.
This turns the inference into an optimisation problem: find the optional \(q\).
To calculate the ELBO, we rewrite it as:
The first term is the ‘variational expectation’ of the model likelihood; the second is the KL from the prior to the approximation.
Calculate the evidence lower bound (ELBO) \(log p(y)\). We rewrite the ELBO as:
The first term is the ‘variational expectation’ (VE); the second is the KL divergence from the prior to the approximation.
Return the time points of our observations.
Return the loss, which is the negative ELBO.
CVIGaussianProcess
Bases: GaussianProcessWithSitesBase
GaussianProcessWithSitesBase
Provides an alternative parameterization to a VariationalGaussianProcess.
This class approximates the posterior of a model with a GP prior and a general likelihood using a Gaussian posterior parameterized with Gaussian sites.
\(x\) - the time points of the training data \(y\) - observations corresponding to time points \(x\) \(s(.)\) - the latent state of the Markov chain \(f(.)\) - the noise free predictions of the model \(p(y | f)\) - the likelihood \(t(f)\) - a site (indices will refer to the associated data point) \(p(.)\) the prior distribution \(q(.)\) the variational distribution
\(t(f)\) - a site (indices will refer to the associated data point)
\(p(.)\) the prior distribution
\(q(.)\) the variational distribution
We use the state space formulation of Markovian Gaussian Processes that specifies:
The conditional density of neighbouring latent states \(p(sₖ₊₁| sₖ)\)
How to read out the latent process from these states \(fₖ = H sₖ\)
The likelihood links data to the latent process and \(p(yₖ | fₖ)\). We would like to approximate the posterior over the latent state space model of this model.
We parameterize the variational posterior through sites \(tₖ(fₖ)\):
…where \(tₖ(fₖ)\) are univariate Gaussian sites parameterized in the natural form:
…and where \(𝜽=[θ₁,θ₂]\) and \(𝛗(f)=[f,f²]\).
Here, \(𝛗(f)\) are the sufficient statistics and \(𝜽\) are the natural parameters. Note that the subscript \(k\) has been omitted for simplicity.
The natural gradient update of the sites can be shown to be the gradient of the variational expectations:
…with respect to the expectation parameters:
That is, \(𝜽 ← ρ𝜽 + (1-ρ)𝐠\), where \(ρ\) is the learning rate.
@inproceedings{khan2017conjugate, title={Conjugate-Computation Variational Inference: Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models}, author={Khan, Mohammad and Lin, Wu}, booktitle={Artificial Intelligence and Statistics}, pages={878--887}, year={2017} }
A tuple containing the observed data:
Time points of observations with shape batch_shape + [num_data]
Observations with shape batch_shape + [num_data, observation_dim]
likelihood – A likelihood with shape batch_shape + [num_inducing].
learning_rate – The learning rate of the algorithm.
Calculate local loss in CVI.
Fmu – Means with shape [..., latent_dim].
[..., latent_dim]
Fvar – Variances with shape [..., latent_dim].
Y – Observations with shape [..., observation_dim].
[..., observation_dim]
A local objective with shape [...].
[...]
Return the local objective and its gradients with regard to the expectation parameters.
Fmu – Means \(μ\) with shape [..., latent_dim].
Fvar – Variances \(σ²\) with shape [..., latent_dim].
A local objective and gradient with regard to \([μ, σ² + μ²]\).
Perform one joint update of the Gaussian sites. That is:
Calculate the evidence lower bound (ELBO) \(log p(y)\).
This is done by computing the marginal of the model in which the likelihood terms were replaced by the Gaussian sites.
Compute the ELBO the classic way. That is:
This is mostly for testing purposes and should not be used for optimization.
Compute the log density of the data at the new data points. :param input_data: A tuple of time points and observations containing the data at which
to calculate the loss for training the model: a tensor of inputs with shape batch_shape + [num_data], a tensor of observations with shape batch_shape + [num_data, observation_dim].