Annealed importance sampling
Posted: 2017-07-20 , Modified: 2017-07-20
Tags: sampling, annealing, temperature
Posted: 2017-07-20 , Modified: 2017-07-20
Tags: sampling, annealing, temperature
We can get an unbiased estimator for \(p(x)\), say \(\wh p\). But we often want \(\ln p(x)\). We use Jensen’s inequality and Markov’s inequality. So \(\ln \wh p\) is a probabilistic lower bound. \[ \E[\ln \wh p] \le \ln p\implies \quad \Pj(\ln \wh p> \ln p + b) <e^{-b}. \] (This is true no matter what the variance is. However, this can be a very loose bound. There is no good way of estimating \(\E \ln X\) from draws of \(X\) (why not?). Oddly, there is a good way of estimating \(\E e^X\) from \(X\) by power series expansion. (Power series for \(\ln\) is terrible over long distances.))
Note this is prone to overestimation with little indication anything is wrong.
Goal: posterior distribution \[ p(z|x,\al) = \fc{p(z,x|\al)}{\int_z p(z,x|\al)}. \] Pick a family of distributions with variational parameters \(q(z_{1:m}|\nu)\). Use \(q\) with fitted parameters as proxy.
So want to minimize \(KL(q||p)\).
Why \(q||p\), not \(p||q\)?
\(\ln p\) doesn’t depend on \(q\).
Given annealed distributions \(p_i\propto f_i\), \(p_K=p\) with Markov chains \(M_i\) (with transition kernels \(T_i\)), to estimate \(Z=Z_K\),
For probabilistic neural nets, use Gibbs sampler (alternately sample \(h\) and \(x\)) as transition.
Think of this as proposing the distribution given by applying the \(T_i\) in sequence.
Giving a stochastic lower bound for \(Z\) means we overestimate log-likelihood.
Go the other way using samples from the target distribution This gives a probabilistic lower bound on \(\fc{Z_0}{Z}\), so a probabilistic upper bound on \(Z\), so we underestimate log-likelihood.
\(p(h|v)\) needs to be tractable. ((z|x) in above notation)
For intractable \(p(h|v)\) combine the AIS (fixing \(v\)) and RAISE steps to get a single estimate. See Algorithm 3 for details.