1. Evidence Lower Bound

Suppose we want to model a set of data using a model with latent variables $z$. Our objective is to maximize the probability of the actual data given our model, meaning we want to maximize $p(x)$.

This quantity is called the marginal likelihood or evidence. A problem is that $p(x) = \int p(z)p(x|z)dz$, so the evidence is often intractable to even compute.

Let $q(z)$ be any distribution of $z$. By Jensen’s Inequality,

We call the right hand side the ELBO (Evidence Lower BOund)1. The ELBO is more useful because we can approximate it with samples from $q(z)$, which can be any distribution we want.

Since the ELBO is a lower bound of the evidence, iteratively maximizing it corresponds to sequentially obtaining a better and better ‘minimum estimate’ of the evidence. In other words, the ELBO can act as a tractable surrogate for the evidence when the inequality is sufficiently tight.

2. Variational Inference Objective + const

Suppose we want to perform Bayesian inference on a probabilistic model $p(x, z)$. This means we want to compute

For most models that are even slightly interesting, the denominator $\int p(x, z) dz$ is intractable.

One way around this issue is variational inference, where we replace an intractable computation problem with an optimization problem. Let $q(z; \phi)$ be a family of distributions parameterized by $\phi$. We solve the optimization problem

Here, $d(\cdot, \cdot)$ is some measure of closeness between probability distributions2.

Let’s use $KL(\cdot || \cdot)$ for $d(\cdot, \cdot)$ and see what happens:

For brevity, we will drop parameters $\phi$ from here on. The first term on the bottom line, $\mathop{\mathbb{E}}_{q(z)} \left[- \log q(z) + \log p(x, z) \right]$, is just a different way of writing the ELBO!

Since the generative process does not depend on the parameters $\phi$, $\log p(x)$ is a constant w.r.t. $\phi$. Therefore, the ELBO can be written as

\begin{align*} \mathrm{ELBO} = -KL \left( q(z) || p(z|x) \right) + \mathrm{const} \end{align*}

Thus, maximizing the ELBO is equivalent to minimizing our original variational objective $KL(q || p)$.

3. Autoencoder

Note that the ELBO is an expectation with respect to the distribution $q(z)$:

meaning we can estimate the ELBO using Monte Carlo samples

The connection with autoencoders becomes clear. The inference model $q(\cdot)$ can be seen as an encoder while the generative model $p(x|z)$ takes the role of the decoder. The process for computing this loss function is to get data $x_i$, encode to get a probabilistic “embedding” $z_i \sim q_\phi(z|x_i)$, and then to decode this embedding to get a distribution in data space $p(x_i | z_i)$.

Continuing with this interpretation, the first term of the ELBO can be seen as a negative reconstruction loss and the second term as a negative regularization loss. Note that these are negative losses since we aim to maximize the ELBO while losses are to be minimized.

4. Free Energy

We rearrange to get

The ELBO is equivalent to the Helmholtz free energy defined by the the energy function $E_x(z)=-\log p(x, z)$. The state of this ‘physical system’ is $z$; it has low energy when the joint likelihood $p(x, z)$ is high. Another way to say this is that the energy is minimized if the given distribution over states $q(z)$ matches the $z$s that appear with the true data $x$ often according to our generative model $p(x, z)$.

As a simple consequence, consider the fact that the distribution over states that minimizes the free energy is the Boltzmann distribution:

where we have assumed that the inverse temperature is $1$, without loss of generality. We see that the distribution $q(z)$ that minimizes the ELBO is $p(z|x)$, the true posterior.

5. Minimum Description Length (Bits-Back Coding)

Recall a basic result of information theory: assuming that an object $x$ follows a distribution $p_1(x)$, and a distribution $p_2(x)$ is agreed upon beforehand, there exists a code to losslessly describe $x \sim p_1(x)$ with expected codelength $\mathbb{E}_{p_2(x)} \left[- \log p_1(x) \right]$.

Suppose we want to describe a datapoint $x$ using the fewest possible number of bits3. Let’s use our generative model $p(x, z)$ to come up with a concise description of $x$. Our strategy is to first describe $z$, and then describe $x$. This description of $x$ should cost less bits because the receiver already has a rough idea of what $x$ is ($p(x|z)$).

The expected length of this two-part code is

The bits-back argument is that this scheme should get a ‘refund’ since we can recover some information about $z$ from $x$. Assume that the sender’s inference model $q(z|x)$ is also agreed on beforehand. The two-part code is redundant by however many bits can be learned about $z$ from $x$ using $q$, which is $- \log q(z|x)$ bits. Thus the true expected description length is

Which is equal to the negative ELBO. Thus, maximizing the ELBO of a generative model is equivalent to minimizing the expected description length of the bits-back code using that model.

1: Note that the only property required for this is that $p(x) = \mathbb{E} \frac{p(x, z)}{q(z)}$. One can derive a tighter lower bound by using different things in place of $\frac{p(x, z)}{q(z)}$, as pointed out in the IWVI paper.

2: One can immediately observe that the quality of the resulting “optimal” distribution $q(\phi^*)$ hinges on having a sufficiently expressive posterior family $q(z; \phi)$. The usual parameterization of $q(z; \phi)$ in VAEs is as the mean and variance of a Gaussian distribution, which is a very small class of distributions. Normalizing Flows have been proposed as a way to express a larger class of distributions in this context.

3: This is known as the minimum description length (MDL) principle.