A generalized KL divergence, Helmholtz free energy, and bits-back coding

Variational Bayesian methods optimize a quantity called the Evidence Lower BOund (ELBO). I think the standard way of motivating and deriving this objective fails to capture what it is and why we should care.

Here is the standard setup and derivation of the ELBO. Unobserved variables \(z\) generate observed data \(x\) according to \(p(x|z)\). We use a distribution \(q(z)\) to approximate the posterior \(p(z|x) \propto p(z)p(x|z)\). We can get a lower bound of the evidence using Jensen’s inequality:

\[\begin{align} \log p(x) &= \log \mathbb{E}_{z \sim q(z)} \left[\frac{p(x, z)}{q(z)} \right] \\ &\geq \mathbb{E}_{z \sim q(z)} \left[ \log \frac{p(x, z)}{q(z)} \right] \triangleq \mathrm{ELBO}. \end{align}\]A natural question is: why is this called “the” evidence lower bound? There are infinitely many quantities that lower bound the evidence. What is the significance of this one?

In this post, I will motivate the ELBO from three different viewpoints. The ELBO directly arises as a quantity of interest in each interpretation, better motivating its use as an objective.

Let \(f(z)\) be an unnormalized density function, meaning that \(f(z) \geq 0\) everywhere and \(0 < \int f(z) dz < \infty\). Normalizing this function to integrate to \(1\) would result in a probability density function:

\[p_f(z) \triangleq \frac{f(z)}{\int f(z) dz} = \frac{f(z)}{Z}.\]Note that the KL divergence between \(q(z)\) and \(p_f(z)\) can be expressed using \(q(z)\) and \(f(z)\):

\[\begin{align} \mathrm{KL} \left( q(z) \rVert p_f(z) \right) &= \mathbb{E}_{z \sim q(z)} \left[ \log q(z) - \log p_f(z) \right] \\ &= \mathbb{E}_{z \sim q(z)} \left[ \log q(z) - \log f(z) \right] + Z \\ &\triangleq \mathrm{KL}^* \left( q(z) \rVert f(z) \right) + Z. \end{align}\]The constant \(Z\) can be ignored if we want to minimize this w.r.t. the distribution \(q(z)\). Note that we have introduced the new symbol \(\mathrm{KL}^* \left( \cdot \rVert \cdot \right)\) to signify the equation’s similarity to the KL divergence. This can be seen as a direct generalization of the KL divergence which measures the similarity between any distribution \(q(z)\) and nonnegative function \(f(z)\). \(\mathrm{KL}^* \left( q(z) \rVert f(z) \right)\) has minimum value when \(q(z) \propto f(z)\), just as \(\mathrm{KL} \left( q(z) \rVert p_f(z) \right)\) has minimum value when \(q(z)=p_f(z)\).

The ELBO is just a special case of this similarity measure where \(f_p(z) = p(x,z)\):

\[\begin{align} -\mathrm{ELBO} = \mathbb{E}_{z \sim q(z)} \left[ \log q(z) - \log p(x, z) \right] = \mathrm{KL}^* \left( q(z) \rVert f_p(z) \right). \end{align}\]Think of the possible values of \(z\) as the phase space of a physical system.
We assume \(x\) is observed and that \((z, x) \sim p(z)p(x|z)\).
The statistical description of \(z\) through Bayes rule is that \(p(z|x) \propto p(z)p(x|z)\).
Equivalently, we can say that \(z\) follows the Boltzmann distribution for energy function

A fundamental result from statistical mechanics: given energy function \(E\), the *free energy* defined as

is minimized when \(q(z)\) is the Boltzmann distribution w.r.t. \(E\)

This interpretation of the ELBO has its roots in the concept of Helmholtz free energy in thermodynamics.
Free energy has since been shown to emerge from the principle of maximum entropy*free energy principle*

We invoke a basic lemma from information theory. Assume that a sender and receiver agree on a distribution \(p_1(x)\) and constructed an optimal code for it. The expected description length of a sample from \(x \sim p_2(x)\) is

\[\mathbb{E}_{x \sim p_2(x)} [-\log p_1(x)].\]We construct a coding scheme for transmitting observations \(x\) generated from unobserved latent variables \(z\). The sender and receiver share knowledge of the generative distribution \(p(x|z)\) along with an approximate posterior \(q(z|x)\). The sender uses the following coding scheme:

- Observe \(x\) from data, sample \(z \sim p(z \vert x)\)
- Encode \(z\) according to \(p(z)\)
- Encode \(x\) according to \(p(x \vert z)\)
- Send encoded \(z, x\)

Naively applying the lemma above to the two parts of the code, it may seem that the expected description length for this two-part code is

\[\mathbb{E}_{x \sim \textrm{data}, z \sim q(z \vert x)} \left[ -\log p(z) -\log p(x|z) \right].\]The **Bits-back argument** states that one can get a “refund” from this code because of duplicate information in \(z\) and \(x\).
The two-part code is redundant because the sender’s inference model \(q(z|x)\) is also known by the receiver.
This redundancy is the number of bits one can learn about \(z\) from \(x\) using \(q(z|x)\), which is \(-\log q(z|x)\) bits.
Therefore, the true expected description length is

so maximizing the ELBO can be seen as minimizing the expected code length of this protocol.

The bits-back argument was first introduced by Hinton and van Camp