Variational Inference (VI)
Variational Inference (VI) is a mathematical framework for approximating complex posterior distributions in probabilistic models. Instead of sampling (as in Monte Carlo methods), VI transforms inference into an optimization problem, making it scalable and efficient for deep learning.
1. Problem Setup
We consider a probabilistic model with observed variables \( x \) and latent variables \( z \). The joint distribution is defined as:
\( x \): observed data
\( z \): latent (hidden) variable
\( p_\theta(x \mid z) \): likelihood (decoder)
\( p_\theta(z) \): prior distribution
\( \theta \): model parameters
The goal is to find parameters that maximize the marginal likelihood of the observed data:
However, this integral is often intractable due to high-dimensional or nonlinear dependencies.
2. The Posterior Distribution
The posterior distribution represents our belief about \( z \) after observing \( x \):
Directly computing this posterior is difficult because it requires evaluating \( p_\theta(x) \), which involves the intractable integral above.
3. Variational Approximation
We introduce an approximate posterior distribution \( q_\phi(z \mid x) \), parameterized by \( \phi \), to approximate the true posterior:
\( q_\phi(z \mid x) \): approximate posterior (encoder)
\( \phi \): parameters of the variational distribution
Goal: make \( q_\phi(z \mid x) \) close to \( p_\theta(z \mid x) \)
4. Kullback–Leibler Divergence
We measure the difference between the two distributions using the KL divergence:
Since the KL divergence is non-negative, minimizing it brings \( q_\phi \) closer to the true posterior.
5. Deriving the Evidence Lower Bound (ELBO)
Using the identity \( p_\theta(z \mid x) = \frac{p_\theta(x, z)}{p_\theta(x)} \), we can rewrite:
Rearranging terms gives:
Because the KL term is always non-negative, we obtain a lower bound on the log evidence:
This lower bound is called the Evidence Lower Bound (ELBO).
6. ELBO Simplification
We can expand \( p_\theta(x, z) = p_\theta(x \mid z) p_\theta(z) \) to express the ELBO as:
\( \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] \): reconstruction term
\( \mathrm{KL}(q_\phi(z \mid x) \Vert p_\theta(z)) \): regularization term
ELBO balances reconstruction accuracy and posterior smoothness
7. Optimization Objective
We maximize the ELBO (or equivalently minimize its negative) with respect to both \(\theta\) and \(\phi\):
8. Gaussian Example
Assume all distributions are Gaussian:
Then, the ELBO becomes:
\( f_\theta(z) \): decoder network mapping latent \(z\) to reconstructed \(x\)
\( \mu_\phi(x), \Sigma_\phi(x) \): encoder outputs (mean and variance)
\( \sigma^2 \): decoder noise variance
9. Summary
1. \(p_\theta(x, z) = p_\theta(x \mid z)p_\theta(z)\): defines the generative model.
2. \(q_\phi(z \mid x)\): approximates the true posterior.
3. The ELBO provides a computable lower bound on the data likelihood.
4. Training maximizes the ELBO to learn both encoder (\(\phi\)) and decoder (\(\theta\)) parameters.
10. References
Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. arXiv preprint, arXiv:1312.6114.
Nakajima, S., & Watanabe, K. (2019). Variational Bayesian Learning Theory. Springer Nature.

Comments
Post a Comment