(WJ08) Graphical Models, Exponential Families, and Variational Inference

Posted: 2016-09-30 , Modified: 2016-09-30

Tags: probabilistic models, graphical models, exponential families, variational inference

1 Introduction

Inference: computing marginal probabilities.

2 Graphical models

Probability distributions that factorize according to underlying graph.

2.1

2.2 Conditional independence

2.3 Statistical inference and exact algorithms

Computational inference problems

For undirected graphs without cycles or directed graphs where each node has a single parent, problems can be solved by recursive message-passing algorithms (cf. DP).

2.4 Applications

2.5 Exact inference algorithms

Sum-product and junctio-tree algorithms are DP algorithms based on method for sharing intermediate terms.

Transform directed to undirected graphical model by making it moral: parents of each child are linked.

Tree-structured graph: For \(u\in N(s)\) let \(T_u=(V_u,E_u)\) be the component of \(u\) after deleting \(s\). \[\begin{align} p(x) &= \rc Z \prod_{s\in V} \psi_s(x_s) \prod_{(s,t)}\psi_{st}(x_s,x_t)\\ \mu_s(x_s) & = \ka \psi_s(x_s) \prod_{t\in N(s)} M_{ts}^* (x_s)\\ M_{ts}^*(x_s) &= \sum_{x_{V_t}'} \psi_{st}(x_s,x_t') p(x_{V_t}'; T_t) \end{align}\]

Compute marginals for all nodes simultaneously in parallel. \(M_{tu}(x_u)\) is passed from \(t\) to \(u\in N(t)\). \[ M_{ts}(x_s) \leftarrow \ka \sum_{x_t'} \ba{ \psi_{st}(x_s,x_t') \psi_t(x_t') \prod_{u\in N(t)\bs \{s\}} M_{ut}(x_t') }. \] (To see this holds for \(M^*\) up to constant, expand out the \(p(x_{V_t}';T_t)\).) For tree-structured graphs, converges to unique fixed point after finite iterations, up to normalization equal to \(M_{ts}^*(x_s)\). (Why? TODO: derive this.)

Max-product: replace \(\sum\) with max. Backtracking computes the argmax.

(Updates apply to arbitrary commutative semirings.)

A clique tree is an acyclic graph whose nodes are maximal cliques of \(G\). Running intersection property: for any 2 clique nodes \(C_1\), \(C_2\), all nodes on the unique path joining them contain \(C_1\cap C_2\). Then it’s a junction tree.

\(G\) has a junction tree iff it is triangulated.

  1. Triangulage \(G\). (HOW)
  2. Form junction tree \(G\). (HOW? Naive way is take 2 vertices, sort into those in between, recurse.)
  3. Run tree inference on junction tree.

The separators are the intersection of cliques adjacent in the junction tree.

Passing a message from \(B\) to \(C\) (through \(S\)): \[\begin{align} \wt \phi_S(x_S) &\leftarrow \sum_{x_{B\bs S}} \phi_B(x_B)\\ \phi_C(x_C) &\leftarrow \fc{\wt\phi_S(x_S)}{\phi_S(x_S)} \phi_C(x_C). \end{align}\]

After a round of message passing, clique potentials are proportional to marginal probabilities.

For a separator set \(S\), let \(d(S)\) be the number of maximal cliques to which it is adjacent. \[p(x) = \fc{\prod_{C\in \mathcal C}\mu_C(x_C)}{\prod_{S\in \mathcal S} [\mu_S(x_S)]^{d(S)-1}}.\] Ex. for \(x_1\to x_2\to x_3\), \(p(x) = \fc{p(x_1,x_2)p(x_2,x_3)}{p(x_2)}\).

Proposition 2.1. A candidate set of marginal distributions \(\tau_C, C\in \mathcal C, \mathcal S\) is globally consistent iff \[\begin{align} \forall S, \sum_{x_S'} \tau_S(x_S') &= 1\\ \forall S\subeq C, \sum_{x_C'|x_S'=x_S} \tau_{C}(x_C') &=\tau_S(x_S). \end{align}\]

Computational cost grows exponentially in size of maximal clique in junction tree. Minimal size of maximl clique over all possible triangulations \(-1\) is treewidth.

Alternate definition (why equivalent?): covered by subgraphs of \(k+1\) nodes in treelike fashion.

2.6 Message-passing algorithms for approximate inference

3 Exponential families

3.1 Exponential representations via maximum entropy

Let \(\{\phi_\al\}\) be a family of functions (called potential functions or sufficient statistics). Say \(p\) is consistent with the data if \[ \EE_p [\phi_\al (X)] = \wh \mu_\al.\] Ex. \(\phi=\{x,x^2\}\). The maximum entropy distribution is \[\begin{align} p^*:&= \amax_{\EE_p [\phi_\al (X)] = \wh \mu_\al} H(p)\\ H(p):&= -\int_X p(x) \ln p(x)\,\nu(dx). \end{align}\]

Under certain conditions, the optimal solution is an exponential family.

3.2 Basics of exponential families

The exponential family associated with \(\phi\) is the parametrized collection of density functions \[\begin{align} p_\te(x_1,\ldots, x_m) &= \exp[\an{\te,\phi(x)} - A(\te)]\\ A(\te) &= \ln \int_{X^m} \exp[\an{\te, \phi(x)}]\nu(dx). \end{align}\]

(\(A\) is the log partition or cumulant function.) Here \(\te\in \Om:=\set{\te\in \R^d}{A(\te)<\iy}\). A regular exponential family has \(\Om\) open. Say \(\phi\) is minimal if \(\phi\cup \{1\}\) is linearly independent; otherwise it is overcomplete.

3.3 Examples of graphical models in exponential form

  1. Ising model: \(p_\te(x\in \{\pm 1\}^V) = \exp\ba{\sum_{s\in V} \te_s x_s + \sum_{(s,t)\in E} \te_{st}x_sx_t-A(\te)}\).
  2. Potts model: Nodes take values in \(\{0,\ldots, r-1\}\). Sufficient statistics are \(\one_{st;jk}(x_s,x_t) = \one_{x_s=j, x_t=k}\). (Note this is overcomplete—the standard overcomplete representation. Note the state space of nodes can also be different.)
    • Metric labeling problem: \(\te_{st;jk} = -\rh(j,k)\) for \(\rh\) a metric.
  3. Sufficient statistics \(x_s,x_s^2, x_sx_t, (s,t)\in E\). \(\Te = -(\E xx^T)^{-1}\) has \((s,t)\) entry 0 for \((s,t)\nin E\).
    • Multivariate Gaussian is \(\exp\ba{\an{te,x} + \rc 2 \Tr(\Te xx^T) - A(\te, \Te)}\). (\(\Om\) consists of symmetric \(\Te\prec 0\).)
  4. Mixture models: Ex. mixture of Gaussians. Let \(\one_j[x_s], j\in [0,r-1]\cap \N\) be sufficient statistics with parameters \(\al_{s;0},\ldots, \al_{s;r-1}\).
    • Let the distributions have parameters \(\ga_{s;i}, \ga_{s;i}'\). The density is \[\propto \exp\ba{\sumz j{r-1} \al_{s;j} \one_j(x_s) + \sumz j{r-1} \one_j(x_s) [\ga_{s;j}y_s+\ga_{s;j}'y_s^2]}.\]
  5. Latent Dirichlet Allocation: \(\al \to U\to Z \to W, \ga \to W\).
    • Topic distribution: \(U\) follows Dirichlet distribution with parameter \(\al\).
    • Topics of document: \(Z\) multinomial drawn from \(U\).
    • Words: \(W\) multinomial drawn from the topic distribution of \(Z\). \(\ga\) specifies distribution of words in topics.
    • Equations \[\begin{align} \Pj(W=j|Z=i;\ga) &= \exp(\ga_{ij})\\ p_\ga(w|z) &= \exp\ba{\sumz i{r-1}\sumz j{k-1} \ga_{ij} \one_i(z)\one_j(w)}\\ p(z|u) &\propto \exp\ba{\sumz i{r-1} \one_i(z)\ln u_i}.\\ p_\al(u) &= \exp\pa{\sumz i{r-1} \al_i\ln u_i}. \end{align}\]
  6. Hard-core constraints: subset of configurations that are forbidden. Ex. decoding linear codes, combinatorial optimization problems.
    • To express as exponential family, take density wrt appropriate bse measure. Ex. restrict to valid codewords. \[\begin{align} p(y_i|x_i) &= \exp(\te_ix_i-\ln (1+\exp(\te_i)))\\ \te_i &=(2y_i-1) \ln \prc{1-\ep}{\ep}. \end{align}\] for BSC.

Simple distributions

Family \((X,\nu)\) Sufficient statistics
Bernoulli \(\{0,1\}\) \(x\)
Gaussian \(\R\) \(x,x^2\)
Exponential \((0,\iy)\) \(x\)
Poisson \(\N\) \(x\)
Beta \((0,1)\) \(\ln x, \ln (1-x)\)

3.4 Mean parameterization and inference problems

Any exponential family has alternative parameterization in terms of a vector of mean parameters.

Realizable mean parameters: \[\mathcal M :=\set{\mu\in \R^d}{\exists p \forall \al, \EE_p [\phi_\al(X)] = \mu_\al}.\]

Ex. Gaussian MRF mean parameters are \(\Si=\E[XX^T]\) and \(\mu=\E[X]\). Realizable parameters are those with \(\Si-\mu\mu^T\succsim 0\). \(\mathcal M\) is always convex. (Take the convex combination of the distributions.) When \(|X^m|\) is finite, \(\mathcal M\) is a convex polytope, \[ \mathcal M = \conv \set{\phi(x)}{x\in X^m}.\]

Ex. Ising mean parameters \[\begin{align} \mu_s &= \EE_p[X_s]=\Pj(X_s=1)\\ \forall (s,t)\in E, \mu_{s,t} &= \EE_p[X_sX_t] = \Pj[(X_s,X_t)=(1,1)] \end{align}\]

The set \(\mathcal M\) is the set of singleton and pairwise marginal probabilities that can be realized by some distribution over \(\{0,1\}^m\), call the correlation/cut polytope. For \(\cdot -\cdot\) this is \(\conv\{(0,0,0),(1,0,0),(0,1,0),(1,1,1)\}\).

Ex. For binary linear codes, \(\mathcal M\) is the codeword polytope, convex hull of codewords.

Ex. For the standard overcomplete representation, \(\mathcal M\) is the marginal polytope.

Q: Codeword polytope can be converted to MRF?

The facet complexity—number of constraints required to express—depends on the graph structure. For non-trees, this grows quickly.

3.5 Properties of \(A\)

Proposition: \(A(\te)=\ln \int_{X^m} \exp[\an{\te,\phi(x)}] \,\nu(dx)\) has derivatives of all orders and is convex on \(\Om\), strictly convex if representation is minimal. \[\begin{align} A_{\te_\al} &= \EE_{\te}\phi_\al(X)\\ A_{\te_\al\te_\be} &= \Cov(\phi_\al(X),\phi_\be(X))\\ \end{align}\]

Proof. DCT, differentiate through integral.

The range of \(\nb A:\Om \to \R^d\) is contained in \(\mathcal M\).

Questions:

  1. When is \(\nb A\) one-to-one?
    • Iff exponential representation is minimal.
    • Proof. Minimal implies strict convexity which gives \(\an{\nb A(\te^1) - \nb A(\te^2), \te^1-\te^2}\).
    • If not minimal, there is 1-to-1 correspondence after modding out by nullspace.
  2. When does image of \(\Om\) fully cover \(\mathcal M\)?
    • When \(\Om\) is open, \(\nb A(\Om)=\mathcal M^{\circ}\).
    • Corollary: except for boundary points, all mean parameters realizable by some distribution can be realized by a member of the exponential family.
    • Proof.
      • Sufficient to prove for minimal \(\phi\). (Just reparametrize.)
      • \(\nb A(\Om)\subeq \mathcal M^{\circ}\): This follows from \(\nb A(\Om)\subeq \mathcal M\), \(\Om\) open, and \(\nb A\) strictly convex.
      • \(\mathcal M^{\circ}\subeq \nb A(\Om)\): WLOG, \(0\in \mathcal M^{\circ}\) and we show \(0\in \nb A(\Om)\).
      • Let \(H_{\ga,\ep} = \set{x\in X^m}{\an{\ga,\phi(x)}\ge \ep}\). For small enough \(\ep\), this has measure \(>0\). Now \[\begin{align} A(\te^0 + t\ga) &\ge \ln \int_{\R^n} e^{\an{\te^0 + t\ga,\phi(x)}}\nu(dx)\\ & \ge t\ep + C_0. \end{align}\] so taking \(t\to \iy\) this blows up. This shows that \(A\) has a minimum where \(\nb A=0\), i.e., \(0\in \nb A(\Om)\).

3.6 Conjugate duality: maximum likelihood and maximum entropy

Conjugate dual function

\[A*(\mu) :=\sup_{\te\in \Om} [\an{\mu, \te} - A(\te)].\]

Let \(\te(\mu)\) be such that \(\EE_{\te(\mu)} [\phi(X)] = \nb A(\te(\mu)) = \mu\).

Theorem.

  1. The conjugate dual function satisfies \[ A^*(\mu) = \begin{cases} -H(p_{\te(\mu)}), &\text{if }\mu\in \mathcal M^{\circ}\\ \iy, &\text{if }\mu\nin \ol{\mathcal M}\\ \lim_{\mu^n\to \mu} A^*(\mu^n). \end{cases} \]
  2. \(A(\te) = \sup_{\mu\in \mathcal M} [\an{\te,\mu} - A^*(\mu)]\). (This holds for any convex function?)
  3. The sup is attained uniquely at \(\mu=\mu(\te)\).
  4. \(\nb A^*\) is the inverse of \(\nb A\).
Family \(\Om\) \(A(\te)\) \(\mathcal M\)
Bernoulli \(\ln(1+e^{\te})\) \([0,1]\) \(-H(\mu)\)
Gaussian \(\R\times \R_{<0}\) \(-\fc{\te_1^2}{4\te_2} - \ln (-2\te_2)\) \(\set{(\mu_1,\mu_2)}{\mu_2-\mu_1^2}>0\)
Exponential \((-\iy,0)\) \(-\ln (-\te)\) \((0,\iy)\)
Poisson \(\R\) \(e^\te\) \((0,\te)\)

3.7 Computational challenges with high-dimensional models

We can compute \(A,\mu\) by solving the variational problem.

The optimization problem is a concave function over a convex set—what’s hard?

Instead approximate!

Thoughts

4 Bethe approximation and sum-product algorithm

4.1 Sum-product and Bethe approximation

Pairwise Markov random field: undirected graphical model with potential functions involving at most pairs of variables. (Often discrete or Gaussian.) \[\begin{align} p_\te(x)&\propto\exp\ba{\sum_{s\in V} \te_s(x_s) + \sum_{(s,t)\in E} \te_{st}(x_s,x_t)}\\ \mu_{st}(x_s,x_t) :&= \sum_{(j,k)\in X_s\times X_t} \mu_{st;jk} \one_{st;jk}(x_s,x_t)\\ \mathbb{M}(G) :&= \set{\mu\in \R^d}{\exists p\text{ with marginals }\mu_s(x_s), \mu_{st}(x_s,x_t)} \end{align}\]

(globally realizable distributions)

(Q: how to convert an undirected GM into an equivalent pairwise form?)

Locally consistent marginal distributions: \[\begin{align} \sum_{x_s} \tau_s(x_s) &= 1\\ \sum_{x_t'} \tau_{st}(x_s,x_t') &=\tau_s(x_s) \end{align}\]

and vice versa. These constraints, and \(\tau\ge 0\), define \(\mathbb L(G)\).

In general \(\mathbb M(G)\sub \mathbb L(G)\). (Elements in \(\mathbb L(G)\bs \mathbb M(G)\) are pseudomarginals.) For trees they are equal.

Proof. Set \[ p_\mu(x) := \prod_{s\in V} \mu_s(x_s) \prod_{(s,t)\in E}\fc{\mu_{st}(x_s,x_t)}{\mu_s(x_s)\mu_t(x_t)}. \]

The negative entropy \(A^*\) as a function of \(\mu\) has a closed form for trees.

\[\begin{align} H(p_\mu) &= \sum_{s\in V} H_s(\mu_s) - \sum_{(s,t)\in E} I_{st}(\mu_{st})\\ H_s(\mu_s) :&= -\sum_{x_s\in X_s} \mu_s(x_s)\ln \mu_s(x_s)\\ I_{st}(\mu_{st}):&=\sum_{(x_s,x_t)\in X_s\times X_t} \mu_{st}(x_s,x_t) \ln \fc{\mu_{st}}{\mu_s\mu_t}. \end{align}\]

Check me!

Bethe entropy approximation (\(\tau\) a pseudomarginal) \[ H_{\text{Bethe}}(\tau) := \sum_{s\in V} H_s(\tau_s) - \sum_{(s,t)\in E} I_{st}(\tau_{st}). = -\sum_{s\in V} (d_s-1)H_s(\tau_s) + \sum_{(s,t)\in E} H_{st}(\tau_{st}). \]

Bethe variational problem: \[ \max_{\tau\in \mathbb L(G)}\ba{ \an{\te, \tau} -H_{\text{Bethe}}(\tau) } \]

The corresponding Lagrangian is \[ \mathcal L(\tau,\la;\te) = \an{\te,\tau} + H_{\text{Bethe}} + \sum_{s\in V} \la_{ss} C_{ss}(\tau) + \sum_{(s,t)\in E} \ba{ \sum_{x_s} \la_{ts}(x_s) C_{ts}(x_s;\tau) +\sum_{x_t} \la_{st}(x_t) C_{st}(x_t;\tau) }. \]

Theorem (Sum-product and Bethe).

  1. Any fixed point of sum-product updates satisfies \[ \nb_{\tau,\la} \mathcal L(\tau^*, \la^*; \te)=0. \]
  2. For a tree-structured MRF, the unique solution has \(\tau\) the singleton and pairwise marginal distributions. The optimal value is \(A(\te)\).
Proof (of 1). Solving the Lagrange multiplier problem (note the multipliers are \(\la_s(x_s)\) and \(\la_{ts}(x_s)\)) gives \[\begin{align} \nb_\tau \ub{(\an{\te,\tau} + \sum H_s - \sum I_{st})}{F} &= \la^TD_\tau C\\ \nb_\tau F&= (\te_s - \cancel{1} - \ln \tau_s +cancel{\sum_t \tau_{st} \rc{\tau_s}}, \te_{st} - 1 - \ln \fc{\tau_{st}}{\tau_s\tau_t})\\ \la^T \nb_\tau C &= (-\la_{ss} + \sum_{t\in N(s)}\la_{ts}, -\la_{st}(x_s) - \la_{st}(x_t))\\ \implies \ln \tau_s &= \la_{ss} + \te_s \fixme{+} \sum_{t\in N(s)} \la_{ts}(x_s)\\ \ln \fc{\tau_{st}}{\wt\tau_s \wt\tau_t} & = \te_{st} \fixme{-} \la_{ts}(x_s) - \la_{st}(x_t) \end{align}\] where the last 2 equations are viewed as functions on \(X_s,X_s\times X_t\). I’m having sign errors and an extra \(-1\) added… Substitute the first equation into the second to get \[ \ln \tau_{st} = \la_{ss} + \la_{tt} + \te_{st}(x_s,x_t) + \te_s(x_s) + \te_t(x_t) + \sum_{u\in N(s)\bs t} \la_{us}(x_s) + \sum_{u\in N(t)\bs s} \la_{ut}(x_t). \] Letting \(M_{ts} = \exp(\la_{ts}(x_s))\), \[\begin{align} \tau_s(x_s) &= \ka_s \exp^{\te_s(x_s)} \prod_{t\in N(s)}M_{ts}(x_s)\\ \tau_{st}(x_s,x_t) &= \ka_{st} \exp^{\te_{st}(x_s,x_t)+\te_s(x_s) + \te_t(x_t)} \prod_{t\in N(s)\bs t}M_{us}(x_s) \prod_{u\in N(t)\bs s} M_{ut}(x_t).\\ \end{align}\]

The constraint \(C_{st}\) becomes \[ M_{st}(x_s) \propto \sum_{x_t} \ba{\exp\ba{\te_{st}(x_s,x_t) + \te_t(x_t)} \prod_{u\in N(t)\bs s} M_{ut}(x_t)}. \] This is exactly the sum-product update.

\(A_{\text{Bethe}}\) is simply an approximation to \(A\). It is a lower bound only for certain classes of models.

Proposition (reparameterization properties). Let \(\tau^*\) be an optimum of BVP defined by \(p_\te\). Then \[ p_{\tau^*}(x) = \rc{Z(\tau^*)} \prod_{s\in V} \tau_s^*(x_s) \prod_{(s,t)\in E} \fc{\tau_{st}^*(x_s,x_t)}{\tau_s^*(x_s)\tau_t^*(x_t)} = p_\te(x). \]

(Note: we had \(\te\) to begin with, so this is not surprising. The goal of BVP is to compute \(A\) which can be used to compute marginals. But we don’t get the real marginals.)

A generalized loop is a subgraph where every vertex has degree \(\ge2\).

Proposition 4.4 (Loop series expansion). For the Ising model, \[\begin{align} A(\te) &= A_{\text{Bethe}}(\te) + \ln \ba{ 1 + \sum_{\phi\ne \wt E\subeq E} \be_{\wt E} \prod_{s\in V}\EE_{\tau_s} [(X_s-\tau_s)^{d_s(\wt E)}] }\\ \be_{st} :&= \fc{\tau_{st}-\tau_s\tau_t}{\tau(1-\tau_s)\tau_t(1-\tau_t)}\\ \be_{\wt E} :&=\prod_{(s,t)\in E}\be_{st}\\ \tau_{st} &=\tau_{st}(1,1). \end{align}\]

Note the term is nonzero only for generalized loops.

Proof. Note if \(\te,\te'\) are such that \(p_\te=p_{\te'}\) then \(A(\te)-A(\te') = A_{\text{Bethe}}(\te)-A_{\text{Bethe}}(\te')\). Q: Why? (First show \(H_{\text{Bethe}}(p_{\te(\mu)}) = H_{\text{Bethe}}(p_{\te'(\mu)})\). Then note for \(\mu\in \mathbb L(G)\), \(\an{\te'-\te,\mu}=0\).)

Check this for a BP fixed point where \(A_{\text{Bethe}}=0\). (See p.3 of 11/16. Check signs.) Calculate (here \(\E\) is wrt \(\prod_s \tau_s\)) \[\begin{align} \fc{\tau_{st}}{\tau_s\tau_t} &= 1+\be_{st}(x_s-\tau_s)(x_t-\tau_t)\\ \exp(A(\wt \te)) &=\E[\prod_{s,t}\in E (1+\be_{st}(x_s-\tau_s)(x_t-\tau_t))]. \end{align}\]

Expand and use \(\E[X_s-\tau_s]=0\).

Q: is there a nice (nonvariational) expression for \(A_{\text{Bethe}}\) like \(A\)?

4.2 Kikuchi and Hypertree-based models

BVP involves approximating both the entropy and \(\mathbb M(G)\). Improve both of these simultaneously. Use more complex junction trees.

A hypergraph can be drawn using a poset diagram of its edges (reverse inclusion).

A hypergraph is acyclic if you can specify (??) a junction tree using maximal hyperedges and intersections. The width is (largest hyperedge)-1. A \(k\)-hypertree is an acyclic hypergraph with a single connected component.

5 Mean field methods

6 Variational methods in parameter estimation

7 Variational methods based on convex relaxations

8 Mode computation

9 Conic programming relaxations

Review

1

2

3

  1. Define an exponential family. What does it mean for it to be regular? Minimal? (3.2)
  2. Write the Ising model as an exponential family. (3.3)
  3. Given \(\phi\), define the set of realizable mean parameters \(\mathcal M\). What is the forward/backward mapping? (3.4)
  4. What are the (realizable) mean parameters for Gaussians? The Ising model? (3.4)
  5. Define the correlation (cut) polytope and marginal polytope. (3.4)
  6. What are the derivatives of \(A\)? (3.5)
  7. When is \(\nb A\) convex/strictly convex/one-to-one? (3.6)
  8. What is the relationship between \(\nb A\) and \(\mathcal M\)? (3.6)
  9. Define the conjugate dual function \(f^*\). When is the double dual the original function? (3.6)
  10. Relate \(A^*\) to the entropy. (3.6)
  11. What is the relationship between \(\nb A^*\) and \(\nb A\)? (3.6)

4

  1. Define \(\mathbb M(G)\), \(\mathbb L(G)\).
  2. Define the Bethe approximation and variational problem. What is its Lagrangian?
  3. Explain the relationship with the sum-product updates.
  4. What is the solution to the problem for tree-structured MRF’s?
  5. What is the loop series expansion?

Misc

The treewidth

Having trouble finding the complete algorithm! I want: Given a graph has treewidth \(k\),

In time \(e^{O(k)}\poly(n)\).