# Five Ways to Dissect an ELBO

### 1. Evidence Lower Bound

Suppose we want to model a set of data using a model with latent variables $z$. Our objective is to maximize the probability of the actual data given our model, meaning we want to maximize $p(x)$.

This quantity is called the marginal likelihood or **evidence**.
A problem is that $p(x) = \int p(z)p(x|z)dz$, so the evidence is often intractable to even compute.

Let $q(z)$ be any distribution of $z$. By Jensen’s Inequality,

We call the right hand side the **ELBO (Evidence Lower BOund)**^{1}.
The ELBO is more useful because we can approximate it with samples from $q(z)$, which can be any distribution we want.

Since the ELBO is a lower bound of the evidence, iteratively maximizing it corresponds to sequentially obtaining a better and better ‘minimum estimate’ of the evidence. In other words, the ELBO can act as a tractable surrogate for the evidence when the inequality is sufficiently tight.

### 2. Variational Inference Objective + const

Suppose we want to perform Bayesian inference on a probabilistic model $p(x, z)$. This means we want to compute

For most models that are even slightly interesting, the denominator $\int p(x, z) dz$ is intractable.

One way around this issue is **variational inference**, where we replace an intractable computation problem with an optimization problem.
Let $q(z; \phi)$ be a family of distributions parameterized by $\phi$.
We solve the optimization problem

Here, $d(\cdot, \cdot)$ is some measure of closeness between probability distributions^{2}.

Let’s use $KL(\cdot || \cdot)$ for $d(\cdot, \cdot)$ and see what happens:

For brevity, we will drop parameters $\phi$ from here on. The first term on the bottom line, $\mathop{\mathbb{E}}_{q(z)} \left[- \log q(z) + \log p(x, z) \right]$, is just a different way of writing the ELBO!

Since the generative process does not depend on the parameters $\phi$, $\log p(x)$ is a constant w.r.t. $\phi$. Therefore, the ELBO can be written as

Thus, maximizing the ELBO is equivalent to minimizing our original variational objective $KL(q || p)$.

### 3. Autoencoder

Note that the ELBO is an expectation with respect to the distribution $q(z)$:

meaning we can estimate the ELBO using Monte Carlo samples

The connection with autoencoders becomes clear. The inference model $q(\cdot)$ can be seen as an encoder while the generative model $p(x|z)$ takes the role of the decoder. The process for computing this loss function is to get data $x_i$, encode to get a probabilistic “embedding” $z_i \sim q_\phi(z|x_i)$, and then to decode this embedding to get a distribution in data space $p(x_i | z_i)$.

Continuing with this interpretation, the first term of the ELBO can be seen as a negative reconstruction loss and the second term as a negative regularization loss. Note that these are negative losses since we aim to maximize the ELBO while losses are to be minimized.

### 4. Free Energy

We rearrange to get

The ELBO is equivalent to the Helmholtz free energy defined by the the energy function $E_x(z)=-\log p(x, z)$. The state of this ‘physical system’ is $z$; it has low energy when the joint likelihood $p(x, z)$ is high. Another way to say this is that the energy is minimized if the given distribution over states $q(z)$ matches the $z$s that appear with the true data $x$ often according to our generative model $p(x, z)$.

As a simple consequence, consider the fact that the distribution over states that minimizes the free energy is the Boltzmann distribution:

where we have assumed that the inverse temperature is $1$, without loss of generality. We see that the distribution $q(z)$ that minimizes the ELBO is $p(z|x)$, the true posterior.

### 5. Minimum Description Length (Bits-Back Coding)

Recall a basic result of information theory: assuming that an object $x$ follows a distribution $p_1(x)$, and a distribution $p_2(x)$ is agreed upon beforehand, there exists a code to losslessly describe $x \sim p_1(x)$ with expected codelength $\mathbb{E}_{p_2(x)} \left[- \log p_1(x) \right]$.

Suppose we want to describe a datapoint $x$ using the fewest possible number of bits^{3}.
Let’s use our generative model $p(x, z)$ to come up with a concise description of $x$.
Our strategy is to first describe $z$, and then describe $x$.
This description of $x$ should cost less bits because the receiver already has a rough idea of what $x$ is ($p(x|z)$).

The expected length of this two-part code is

The bits-back argument is that this scheme should get a ‘refund’ since we can recover some information about $z$ from $x$. Assume that the sender’s inference model $q(z|x)$ is also agreed on beforehand. The two-part code is redundant by however many bits can be learned about $z$ from $x$ using $q$, which is $- \log q(z|x)$ bits. Thus the true expected description length is

Which is equal to the negative ELBO. Thus, maximizing the ELBO of a generative model is equivalent to minimizing the expected description length of the bits-back code using that model.

1: Note that the only property required for this is that $p(x) = \mathbb{E} \frac{p(x, z)}{q(z)}$. One can derive a tighter lower bound by using different things in place of $\frac{p(x, z)}{q(z)}$, as pointed out in the IWVI paper.

2: One can immediately observe that the quality of the resulting “optimal” distribution $q(\phi^*)$ hinges on having a sufficiently expressive posterior family $q(z; \phi)$. The usual parameterization of $q(z; \phi)$ in VAEs is as the mean and variance of a Gaussian distribution, which is a very small class of distributions. Normalizing Flows have been proposed as a way to express a larger class of distributions in this context.

3: This is known as the minimum description length (MDL) principle.