Annealed importance sampling

Posted: 2017-07-20 , Modified: 2017-07-20

Tags: sampling, annealing, temperature

Papers

log p

We can get an unbiased estimator for \(p(x)\), say \(\wh p\). But we often want \(\ln p(x)\). We use Jensen’s inequality and Markov’s inequality. So \(\ln \wh p\) is a probabilistic lower bound. \[ \E[\ln \wh p] \le \ln p\implies \quad \Pj(\ln \wh p> \ln p + b) <e^{-b}. \] (This is true no matter what the variance is. However, this can be a very loose bound. There is no good way of estimating \(\E \ln X\) from draws of \(X\) (why not?). Oddly, there is a good way of estimating \(\E e^X\) from \(X\) by power series expansion. (Power series for \(\ln\) is terrible over long distances.))

Note this is prone to overestimation with little indication anything is wrong.

ELBO

Goal: posterior distribution \[ p(z|x,\al) = \fc{p(z,x|\al)}{\int_z p(z,x|\al)}. \] Pick a family of distributions with variational parameters \(q(z_{1:m}|\nu)\). Use \(q\) with fitted parameters as proxy.

So want to minimize \(KL(q||p)\).

Why \(q||p\), not \(p||q\)?

\[\begin{align} KL(q||p) &=\EE_q\ba{\ln \fc{q(z)}{p(z|x)}}\\ \ln p &=\ln \int \EE_{z\sim q}\ba{\fc{p(x,z)}{q(z)}}\\ & \ge \EE_q \ln p(x,z) - \EE_q [\ln q] :=ELBO\\ KL(q||p) &=-ELBO - \ln p(x) \end{align}\]

\(\ln p\) doesn’t depend on \(q\).

AIS

Given annealed distributions \(p_i\propto f_i\), \(p_K=p\) with Markov chains \(M_i\) (with transition kernels \(T_i\)), to estimate \(Z=Z_K\),

For probabilistic neural nets, use Gibbs sampler (alternately sample \(h\) and \(x\)) as transition.

Think of this as proposing the distribution given by applying the \(T_i\) in sequence.

Giving a stochastic lower bound for \(Z\) means we overestimate log-likelihood.

RAISE

Go the other way using samples from the target distribution This gives a probabilistic lower bound on \(\fc{Z_0}{Z}\), so a probabilistic upper bound on \(Z\), so we underestimate log-likelihood.

\(p(h|v)\) needs to be tractable. ((z|x) in above notation)

For intractable \(p(h|v)\) combine the AIS (fixing \(v\)) and RAISE steps to get a single estimate. See Algorithm 3 for details.

Notes