GSoC #2: \(x_0\) marks the spot on the MAP

3 minute read

Published:

Over this past week, I’ve been working to get this issue closed out. INLA allows us to perform Bayesian inference over models with latent Gaussians. By “latent”, we mean that the Gaussian component is not what we directly observe, but is related to the observed data nonetheless.

Consider the example of a linear mixed model:

\begin{equation} y = X\beta + Zb + \epsilon \end{equation}

\(\beta\) and \(b\) have slightly different meanings here (\(\beta\) are the “determinstic” regression coefficients which we use to fit the model, whereas \(b\) describes internal dynamics governing the system’s randomness), but we can approximate them both using zero-mean Gaussians \(\beta \sim N(0, R)\) and \(b \sim N(0, \Sigma_b)\), which allows us to concatenate this into a single linear term with \(A = [X:Z]\) and \(x=(\beta^T, b^T)^T\). Since we now have \(y = A x + \epsilon\), we note that our observables \(y\) are directly linked to this unseen quantity \(x\), hence the term “latent”.

This gives rise to a three-tier hierarchical model: the observables \(y\) depend on the latent \(x\), and \(x\) is parameterised itself by some parameters \(\theta\):

\begin{equation} p(y,x,\theta) = p(y\mid x,\theta)p(x\mid \theta)p(\theta) \end{equation}

Although we aim to find the values of \(x\), in the Bayesian setting this is in fact a distribution, which we parametrise using \(\theta\), so what we really want is to do Bayesian inference over \(\theta\):

\begin{equation} p(\theta\mid y) = \frac{p(y\mid x,\theta)p(x\mid \theta)p(\theta)}{p(x\mid y,\theta)} \frac{1}{p(y)} \end{equation}

To compute \(p(x\mid y,\theta)\), we need to margainalise out \(y\) and \(\theta\), which involves integration. As I alluded to in last week’s post, integrals of exponentials can be approximated in a way which leads to a result which takes the form of a Gaussian evaluated at the mode of the system \(x_0\), which brings us full circle back to the issue I’ve been working on solving: Implementing a mode-finder.

At a high level, this problem is easy to pose: minimize the negative log likelihood of \(y\) wrt \(x\) and you get the mode \(x_0\)1. The hard part is actually implementing a minimiser. Fortunately, SciPy already has functionality for this problem, and even better, Jesse and Ricardo had just finished implementing a PyTensor-native wrapper for scipy.optimize, which made this a piece of cake.

At least it should have.

As is always the case, unexpected roadblocks slowed me down. Getting the hang of PyTensor and PyMC and getting a proper understanding of what the mathematical theory meant in practice lead to several dead ends and goose chases, particularly with deciding how to handle args describing variables other than \(x\). With the week already behind us (writing this on Wednesday… oops), the PR is almost ready, but just needs a few quality-of-life touch-ups. The hope is to get it cleared for a merge by the end of this week. In the meantime, I’ve made a start digging into the next issue on my list.

  1. EDIT: This is wrong!! Please see this blog post for clarification.