Introduction
Two approaches for calculating partition functions
- Markov chains to sample
- (Jerrum04, JerrumSinclair1993) Certain Markov chains mix rapidly; used to approximate permanent with nonnegative entries and partition function for ferromagnetic Ising.
- Variational methods: partition function is the solution of a (intractable) optimization problem over the polytope of valid distributions.
- Popular, faster, easier to parallelize
- Little theoretical understanding
- Another algorithm: Belief propagation (solving non-convex relaxation)
- Regime of correlation decay and locally tree-like graphs.
Use Sherali-Adams and Lasserre convex programming hierarchies to get the first provable, convex variational methods. They work because local correlations propagate to global correlations, the opposite of correlation decay.
(See hierarchies.)
Definitions and setting
Ising model \(p(x)\propto \exp\pa{\sum_{i,j} J_{i,j}x_ix_j}\), \(x\in \{-1,1\}^n\) is \(\De\)-dense if letting \(J_T=\sum_{i,j}|J_{i,j}|\), \[
\forall i,j\in [n], \De |J_{i,j}|\le \fc{J_T}{n^2},
\] (Ex. if the \(J_{i,j}\) is 1 for an edge and 0 for a non-edge, and the graph has density \(cn^2\), then \(\De = \rc{c}\).)
\(p\) is regular if \(\sum_j |J_{i,j}|=J'\). The adjacency matrix is \(A_{i,j} = \fc{|J_{i,j}|}{J'}\). Let \(\rank_\tau(A)\) be the number of eigenvalues with \(\ad\ge\tau\).
Why dense Ising models?
- There are PTAS for CSPs for dense constraint graphs.
- Mean-field ferromagnetic Ising model: each spin interacts with every other spin.
Result
- Algorithm based on SA hierarchy achieving additive approximation of \(\ep J_T\) to \(\ln Z\) in time \(n^{O\prc{\De \ep^2}}\).
- Algorithm based on Lasserre hierarchy achieving additive approximation of \(\ep n J'\) to \(\ln Z\) in time \(n^{\rank_{\Om(\ep^2)}(A)/\Om(\ep^2)}\).
Method
Dense graphs, SA
- (Variational characterization of \(\ln Z\)) \(\ln Z = \max_{\mu\in M} \sum_{i\sim j} J_{i,j} \E_\mu [x_ix_j] + H(\mu)\) where \(M\) is the polytope of distributions over \(\{-1,1\}^n\). (Maximum achieved at \(\mu=p\).)
- Proof. The KL divergence to \(p\) is \(\ge 0\).
- Relax to \(M'\supeq M\).
- Need to design surrogates for \(H(\mu)\).
- Popular choice: Bethe entropy. (But it is not an upper bound in general.)
- Instead, design \(\wt H(\mu) \ge H(\mu)\) whenever \(\mu\in M\), then round to distributions.
- Define the augmented mean-field pseudo-entropy functional \[
H_{aMF,k}(\mu) = \min_{|S|\le k}H(\mu_S) + \sum_{i\nin S} H(\mu_i |\mu_S).
\]
- By the chain rule and “conditioning reduces entropy”, \(H(\mu)\le H_{aMF,k}(\mu)\).
- \(H(\mu_i|\mu_S)\) is concave, so \(H_{aMF,k}(\mu)\) is concave. (Proof omitted.)
- The relaxation is \[
\max_{\mu\in SA(O(1/(\De \ep^2)))} \bc{
\sum_{i,j} J_{i,j} \E_\mu[x_ix_j] + H_{aMF,k}(\mu)
}
\]
- Correlation rounding: pick a seed set, condition on it, and round other variables independently. (YoshidaZhou2014) There is a set of size \(k=O\prc{\De \ep^2}\) such that \[
\ab{\sum_{i,j} J_{i,j} \E_\mu[x_ix_j|x_S] - \sum_{i,j} J_{i,j} \E_\mu [x_i|x_S] \E_\mu[x_j|x_S]}\le \fc{100}{\De k}J_T.
\]
- Letting \(\wt\mu(x) = \mu(x_S) \prod_{i\nin S} \mu(x_i|x_S)\), \(\E_{\wt\mu} [x_ix_j|x_S] = \E_\mu[x_i|x_S]\mu[x_j|x_S]\).Combine YZ14 with \(H\le H_{aMF,k}\).
Low threshold rank, Lasserre
- The correlation rounding bound changes to: \(\exists |S|\le \rank_{\Om(\ep^2)}(A)\), \(...\le \ep J_T\).
Interpretation
- High temperature \(|H| = O\prc{d}\)
- Dobrushin’s uniqueness criterion: correlation decay
- MC methods work and give \((1+\ep)\) factor approximation, stronger.
- Transition threshold \(|J|=\Te\prc{d}\).
- MC provide no nontrivial guarantee.
- We get \(\ep n\) additive factor approximation.
- Low temperature \(|J|=\om\prc{d}\). :(