Reference: Barak’s notes. Barak and Steurer’s survey.
“Sum-of-squares” is an algorithm that attempts to find feasible solutions to systems of polynomial inequalities, or a proof that there is no solution. It is widely applicable because many computational problems can naturally be put into the form of polynomial inequalities (ex. \(k\)-SAT can be encoded with degree \(k\)). The degree \(d\) of sum-of-squares is a tunable parameter; the algorithm runs in \(n^d\) time. We can look at
Why is SoS a natural notion?
There are many equivalent notions of sum-of-squares; we give 4. The first definition introduces an algorithm that searches for a pseudo-expectation operator; on the boolean cube this is essentially equivalent to a pseudo-distribution. We also give equivalences to a sum-of-squares proof and sum-of-squares representation.
In the next section we will motivate these definitions and show equivalences.
Let \(x=(x_1,\ldots, x_n)\) and \(\R[x]_{\le d}\) denote polynomials in \(x_1,\ldots, x_n\) of degree \(\le d\).
1$$2: See exercise 1.7.
2$$3: This is the SOS Theorem. It encompasses the Positivstellensatz, which says that every unsatisfiable system of equalities has a finite degre proof of unsatisfiability. (See exercise 1.14, 15 for part of the theorem.)
3$$4: See exercise 1.11.
The most common use of sum-of-squares is in SDP relaxations of combinatorial problems, as follows. Let \(f\) be a convex function. The goal is to find \[ \min_{x\in \{-1,1\}^n}f(x). \] One way to relax this is to write \(f(x)=g(x^Tx/n)\), and find \[ \min_{M\succeq 0, M_{ii}=1} g(M). \] This is a relaxation because if \(x\) is the optimal solution to the first problem, \(x^Tx/n\) is a feasible point for the second problem achieving the same value.2
Think of \(M_{ii}=1\) as degree-2 constraints that shrink the space we’re minizing over to be closer to just the set \(\set{x^Tx/n}{x\in \{-1,1\}^n}\).
Note this is the degree-2 SoS relaxation of this problem!3 The degree-\(l\) SoS relaxation is the optimization problem \[\min_{M\text{ pseudo-expectation satisfying }x_i^2=1} h(M)\] (note we showed that the set of pseudo-expectations is convex; it is defined by \(M\succeq 0\) and some linear equations). (\(x_i^2=1\) corresponds to \(M_{ii}=1\).) In the equation above we thought of \(M\) as a bilinear form on \(\R^n\times \R^n = \R[x]_{=1}\times \R[x]_{=1}\); here we’re just adding 1 to the vector space to get \(\R[x]_{\le 1}\times \R[x]_{\le 1}\) and stipulating \(M(1,1)=1\), which doesn’t change anything.
(If \(f\) is a linear function \(f(x_1,\ldots, x_n,x_1^2,x_1x_2,\ldots, x_1^3,\ldots)\), then \(g(M)\) is the function \(f(M(x_1,1),\ldots, M(x_n,1), M(x_1,x_1),\ldots)\).)
For example, the degree-4 SoS relaxation would correspond to optimizing over \((\R^2,\R^1,\R)^{\ot 2}\), where the solutions corresponding to solutions of the original problem are in the form \((x^{\ot 2},x,1)^{\ot 2},x\in \{-1,1\}^n\). For $M\((\R^2,\R^1,\R)^{\ot 2}\), the solutions satisfy equations like \(M_{ii,jk}=1, M_{i,ij}=1\), etc.—corresponding to multiples of polynomials \(x_i^2-1\).
Define
We can express this in terms of the characteristic function \(x\) of \(S\) defined as \(x_i=(i\in S)-(i\nin S)\) by \[ \an{x,Lx} = \rc{2d}\sum_{i\sim j} (x_i-x_j)^2 =\fc{4E(S,\ol S)}{d} = 2n\text{cut}(S). \]
Lemma: Let \(\wt{\E}\) be a degree-2 pseudo-expectation operator. Then there is a Gaussian distribution \(N\) such that \[\wt{\E_{x\sim\mu}} p(x) = \E_{y\sim N} p(y).\] The pseudo-expectation operator is a bilinear form on \(\R[x]_{\le 1}\times \R[x]_{\le 1}\) is associated with a matrix \(A=B^2\). Then \(y=Bx\) where \(x\sim N(0,1)\), i.e., \(y\) is the Gaussian with covariance matrix \(B\).
A lemma about Gaussians.
Lemma: If \((x,y)\) are \(1-\ep\)-correlated Gaussians, then \((x,y)^T =Av\) where \(A=\rc2 \smatt{\sqrt\ep+\sqrt{2-\ep}}{\sqrt{\ep}-\sqrt{2-\ep}}{\sqrt{\ep}-\sqrt{2-\ep}}{\sqrt{\ep}+\sqrt{2-\ep}}\), \(v\sim N(0,I)\). In other words, if \((x,y)\) are \(\cos(\te)\)-correlated Gaussians, then \(A=\smatt{\sin(\fc\pi4-\fc\te2)}{\sin(\fc\pi4+\fc\te2)}{\cos(\fc\pi4-\fc\te2)}{\cos(\fc\pi4+\fc\te2)}\) (I might have gotten cos/sin switched).
(Sanity check: when \(\te=0\) they point in \(\pi/4\), when \(\te=\fc \pi2\) they point in \(0,\fc\pi2\).)
Lemma: Let \(y,y'\) be Gaussians with variance 1 such that \[\E(y-y')^2 \ge 4(1-\de) = 2+2\cos(\te).\] Then \[\Pj(\sgn(y) = \sgn(y')) = \fc{\te}{\pi} = O(\sqrt{\de}).\]
Theorem: There is a polynomial-time algorithm which given a \(n\)-vertex \(d\)-regular graph \(G=(V,E)\) and a degree 2 pseudo-distribution with \(\wt E_{x\sim \mu} x_i^2=1\), \(\wt E\an{x,Lx} \ge 2n(1-\ep)\), outputs \(z\in \{\pm1\}^n\) such that \(\an{z,Lz} \ge (1-f_{GW}(\ep))=O(\sqrt \ep)\). (Add the precise form of \(f\).)
Proof: Let \(z_i=\sgn(y_i)\) and use Lemma on Gaussian variables on each term.
This is optimal. To see optimality in order of magnitude, take the odd cycle on \(n=\rc{\sqrt{\ep}}\) vertices, or a union of these. To see optimality in an additive sense, use the Feige-Schectman graph (random points on sphere connected in \(\an{v_i,v_j}\le -1+\ep\)).
Show that the FS graph can be solved by a degree 4 SoS. This is the same as saying there is no degree-4 distribution over \(\R^n\) consistent with \(\{x_i^2=1\}\) such that \[\wt E\sum (x_i-x_{i+1})^2>4(n-1).\] (We had a degree-2 distribution with \(\wt E>4n\pa{1-O\prc{n^2}}\), which gave the gap.)
Proof: The squared triangle inequality holds for degree-4 pseudodistributions. Sum up inequalities \(\wt E(x_i-x_{i+1})^2 \le \sum_{j\ne i} \wt E(x_j+x_{j+1})^2\).
Applications
Sum-of-squares is a powerful meta-algorithm.
Other meta-algorithms:
Gradient descent, integer linear programming: for a given problem there are many ways of formulating it. It depends crucially on how you set it up. SoS doesn’t suffer this. There is a mechanical way to apply it.
The vanilla version is impractical, but there are efforts to extract practical algorithms. They resemble heuristic algorithms used in practice.
Given \(p\in \R[x_1,\ldots, x_n]\), \(p\ge 0\) over \(\R^n\),
Krivine characterized systems of polynomial inequalities without solutions over \(\R^n\). (Positivstellensatz, cf. Farkas lemma)
Given \(f:\{0,1\}^n\to \R\) (given as coefficients in monomial basis up to degree \(\deg f\)), is \(f\ge 0\) or \(\exists x\in \{0,1}^n, f(x)<0\)? The number of coefficients is \(\le n^{\deg f}\). (Every monomial is actually a subset.) This is NP-hard for degree 2.
Max-cut: \[ \max cut(G) \le c \iff c - \sum_{\{i,j\}\in E(G)} (x_i-x_j)^2 \ge 0. \]
Given \(f\), the algorithm outputs either a short proof for \(f>0\) or an object ptends to be a collection of \(x\in \{0,1\}^n\).
Degree \(d\) SoS certificate for \(f\): \(\deg g_i\le \fc d2\), \[\forall x\in \{0,1\}^n, f(x) = \sumo iv g_i(x)^2.\] It can be the case that \(\deg g_i>\deg f\)! That would be true over \(\R\), but we are working over \(\{0,1\}^n\).
For degree 2, we can do this efficiently over \(\R\); this question is hard because we restrict to the hypercube.
If \(f\) has a degree \(d\) sos certificate, we can find degree-\(d\) certificate for \(f+2^{-n^d}\) in time \(n^{O(d)}\). (Can’t solve general convex problems exactly.)
Theorem. For all \(G\), there exists a degree-2 sos certificate, \(\max(f_G) - 0.878 f_G\). This estimates MAX-CUT up to \(0.878\).
Open question: if we replace 2 by 4, can we replace \(0.878\) by a larger constant? This would disprove unique games.
Theorem 5. \(f\) has degree-\(d\) sos certificate iff there exists positive semidefinite \(A\) such that \[\forall x\in \{0,1\}^n, f(x) = \an{(1,x)^{\ot d2}, A (1,x)^{\ot \fc d2}}.\] (Proof. \(A\) has a square root.)
Theorem. If \(f\ge 0\), then it has a degree \(2n\) sos certificate.
Proof. \(f=g^2, g=\sqrt f\), \(\deg g\le n\), or \(f(x) = \sum_{y\in \{0,1\}^n} (\sqrt{f(y)}\one_y(x))^2\).
Proof (finding sos certificates). \(f(x) = \an{(1,x)^{\ot d/2}, F(1,x)^{\ot d/2}}\). Find \(A\), \[ A-T\in W = \set{Q}{\an{(1,x)^{\ot d/2}, Q(1,x)^{\ot d/2}}\equiv 0\text{ over }\{0,1\}^n}. \] Look at \(F+W\cap\)cone.
We just need an efficient separation oracle.
Cheeger: Let \(G\) be a \(d\)-regular graph and \(L_G=I-\rc dA\) be its Laplacian, and \(\la\) be the second smallest eigenvalue. Then \[\la \ge \Om(\ph(G)^2).\]
The second smallest eigenvalue of \(L_{C_n}\) is \(O\prc{n^2}\). (Proof: compute.) This shows Cheeger is tight, as we have \[ \la = \Te\prc{n^2}, \quad \ph(G) = \Te\prc{n}. \]
Maxcut: We showed that \[maxcut (G)\le 1-\ep \implies (\forall \deg \mu=2, \wt{\EE_{\mu}} f_G(x) \le 1-\Om(\ep^2)).\]
To show tightness up to constant, show that for \(G=C_n\) there is \(\mu\): \[\begin{align} maxcut(C_n) &\le 1-\rc n =:1- \ep\\ \wt{\EE_\mu} f_G(x) &= 1-\Te\prc{n^2}. \end{align}\] Proof. Let \(n=2k+1\). Define \(u, v, w, X, \mu\) as follows. \[\begin{align} u=(\om^{jk})_{j=0}^{n-1} &= u+iw\\ X&=vv^T+ww^T\\ \wt{\EE_\mu} x &= \rc2 \one\\ \wt{\EE_{\mu}} xx^T &= \rc 4 \one \one^T + \rc 4 X \end{align}\](We can find a degree-2 pseudodistibution whenever \(\wt{\EE_\mu} xx^T - (\wt{\EE_\mu} x)(\wt{\EE_{\mu}}x)^T\) is psd and the diagonal of \(\wt{\EE_\mu} xx^T\) equals \(\wt{\EE_\mu}x\). (Do we want it to be in \([0,1]\)?)) Calculation: \[ \wt{\EE_{\mu}} f_G(x) = \rc 4 \sum |u_i-u_j|^2 = n\pa{1-\Te\prc{n^2}}. \] (Working backwards, setting \(X=vv^T+ww^T\) and then finding what the values of \(v,w\) should be, we want \(\sum_{(i,j)\in E}|u_i-u_j|^2\) to be large. This motivates hops of \(\om^k\). Calculations: \(\E(x_i-x_j)^2 = \E[1-\fc{2v_iv_j+2w_iw_j}4]\).)
\[\begin{align} \al_{GW} &= \min_{0\le x\le 1} \fc{\cos^{-1}(1-2x)}{\pi x}\\ &=\min_{-1\le \rh\le 1} \fc{2\cos^{-1}\rh}{(1-\rh)\pi}\\ \rh_{GW} &= \amin_{-1\le \rh\le 1} \fc{2\cos^{-1}\rh}{(1-\rh)\pi} \end{align}\] Theorem (Tightness of GW): For every \(\ep>0\) there is \(G\) and \(\mu\) such that \[\begin{align} maxcut(G) &\le \al_{GW} x_{GW} + \ep\\ \wt{\EE_{\mu}} f_G(x) &\ge x_{GW}. \end{align}\]Proof. Feige-Schechtman graph: Connect up points on the sphere if \(\an{v,w}\le \rh_{GW}+\ep\). (Later reduce to a discrete graph by sampling.) We have variables \(\{X_v\}_{v\in \R^d}\), \(X_v = \rc 2 + \fc{\an{v,g}}2\), \(\Cov(X_u,X_v) = \E \an{u,g}\an{v,g} = \an{u,v} \le -\rh_{GW} + \ep\) if \(\an{u,v}\le \rh\).
Theorem. It is NP-hard to solve \(\text{Max-3XOR}_{\rc 2+\de, 1-\ep}\).
There are reductions \(3\SAT\stackrel{\wt O(n)}{\le}\text{LabelCover} \stackrel{O(n)}\le \text{3XOR}\).
The fraction of constraints satisfied by 3XOR problem \(x_i+x_j+x_k = a_{ijk}\) is \[ f_\psi = \rc 2 + \rc{2|\psi|}\sum_{\{i,j,k\}\in \psi} (1-a_{ijk})(1-2x_i)(1-2x_j)(1-2x_k). \]
Theorem (Grigoriev). For every \(\ep>0\), for large enough \(n\), there is an instance \(\psi\) of Max-3XOR such that
Intuition: Unlike Gaussian elimintion, the sos algorithm cannot distinguish between a perfectly and \(1-o(1)\) satisfiable system.
Proof. Take a random \((m,n)\) bipartite graph with left-degree 3. Left is constraints, right is variables.
Completeness: (cf. expander codes.) Let \(\chi_S = \prod_{i\in S}(1-2x_i)\). Idea: make pseudistribution “uniform” subject to constraints.
We define \(\wt{\EE_{\mu}}\chi_S\) for \(|S|\le \ep n\). If \(\wt \E \chi_S\), \(\wt \E \chi_T\) have been defined and \(|S\triangle T|\le d=\ep n\), set \(\wt \E \chi_{S\triangle T} = (\wt \E \chi_S)(\wt \E \chi_T)\). Set remaining \(\wt \E \chi_S=0\).
A degree \(d\) derivation is a set that is reached after some number of steps, each set on the way being \(\le d\).
WHP the graph is an expander. If we have degree \(d\) derivations for \(\sum_{i\in S} x_i\equiv 0\) and \(\sum_{i\in S} x_i\equiv 1\), then we get a degree \(2d\) derivtion of \(\sum_{i\in \phi} x_i=1\). Impossible—similar to proof of expander codes: In order for \(T_t\) to be \(\bigopl_{l\in T_t}\Ga(l) = \phi\), \(T_t\) is large; take the first large \(T_i\) in the derivation, it will have large neighbor set.
Show \(\deg p\le \fc d2\implies \wt \E p^2\ge 0\) by Fourier expansion. Break \(|S|\le \fc d2\) into equivalence classes based on \(\E[\chi_{S\triangle T}]\ne 0\). For \(p_i=\sum_{S\in C_i} p_S\chi_S\), \[ \wt \E p_i^2 = \sum_i (\sum_S p_S\wt{\E} x_{S\triangle S_i})^2\ge 0.\]
Exercise 11. DO this!
To reduce from Max 3SAT \((\fc 78+\ep, 1)\), convert \(a_{\ell_i}x_i \vee a_{\ell_j}x_j \vee a_{\ell_k}x_k=1\) to \(x_i+x_j+x_k \equiv a_{\ell_i}+a_{\ell_j} + a_{\ell_k}\).
A nice \(V\in \F_2^L\) is one such that every \(u\in V^{\perp}\bs \{0\}\) has \(\ve{u}_0\ge 3\). A nice subspace predicate is one such that there exists nice \(V\), \(P(x)=1\iff x\in V\).
Theorem (SoS hardness for nice-subspace CSP, SoS PCP). Let \(k,\ep>0\) be given. There exist \(\be = 2^k\prc{\ep^2}\), \(c=\poly\prc{\be}\) such that there is an instance \(\psi\) of Max-P for [\(k\)-variate predicate \(P\) with \(\le 2k\) satisfying assignments] on \(n\gg \rc c\) variables and \(m=\be n\) constraints such that
Read the rest (15-21)
Theorem. Let \(G\) be a \(d\)-regular graph with vertex set \([n]\) and \(\mu:B^n\to \R\) be a degree-4 pseudo-distribution. Then \[ \ph(G) = \min_{x\in B^n} \fc{f_G}{\fc dn |x|(n-|x|)} \le O(\sqrt{\ln n}) \ph_\mu(G). \] There is a polynomial time algorithm that given \(G\) and \(\mu\), finds \(x\) such that \(\ph(G)\le O(\sqrt{\ln n})\ph_\mu(G,x)\).
Proof.
Let \(d(i,j)=\wt{\EE_{\mu}} (x_i-x_j)^2\). For degree 4, we have \(d(i,j) + d(j,k) \ge d(i,k)\).
This gives a lower bound on \(\E \max_{i,j\in [n]} X_j-X_i\ge \fc{\Phi(k)}n\).
Lemma (chaining). \[\Phi(k+1) \ge \Phi(k) + \E |M| - O(n) \max_{i\in [n], j\in H^{k+1}(i)} \pa{\E(X_i-X_j)^2}^{\rc 2}.\]We want \(\min_{x\in B^n}f\); degree \(d\) SoS finds \(\min_{\deg \mu = d} \wt{\EE_{\mu}} f\).
Often the global minimum is at a single point, but pseudodistribution pretends to be high entropy over nonexistent elements, “supported on unicorns”.
Replace \(B^n\) with \(\Om\subeq \R^n\) defined by polynomial inequalities \(A=\{f_1\ge 0,\ldots, f_m\ge 0\}\). Questions:
A degree \(l\) SoS proof is \(g=\sum_{S\subeq [n]} p_S \prod_{i\in S} f_i\), where \(p_S\) is SoS, each \(\deg(p_S \prod_{i\in S} f_i)\le l\). Write \(A\vdash_l \{g\ge 0\}\). (There is a SoS proof of degree \(d\) of \(g\ge 0\) from \(A\).)
Theorem (Positivstellensatz). Every \(A\) is either feasible, or \(A\vdash_l\{-1\ge 0\}\) for some \(l\in \N\).
SoS has inference rules for addition, multiplication (add degrees), transitivity (multiply degrees), and substitution (multiply with degree of substituted \(H\)).
Lemma. Let \(\mu\) have finite support, \(\wt{\EE_{\mu}} 1=1\). \(\mu\) is degree \(d\) pd iff \(\wt{\EE_{\mu}} ((1,x)^{\ot d/2})((1,x)^{\ot d/2})^T\) is psd.
Proof. Writing \(p = \an{v,(1,x)^{\ot d/2}}\), \(\wt{\EE_{\mu}} p^2 = \wt{\EE_{\mu}} \an{v,(1,x)^{\ot d/2}}^2\).
Write \(\mu\vDash_l A\) (satisfy \(A=\{f_i\}\) at degree \(l\)) if for all \(S\subeq [m]\), every SoS with \(\deg h + \sum_{i\in S} \max(\deg f_i,l)\le d\) satisfies \(\wt \E_\mu h \prod_{i\in S}f_i\ge 0\) for all SoS’s \(h\). (Ex. for \(A=\{f\ge 0\}\), take \(S=\{f\}\).)
Question: how does this mesh with \(B^n\) definition?
Theorem (Duality). Suppose \(\ve{x}^2\le M\) is in \(A\). For all \(d\in \N\), \(f\in \R[x]_{\le d}\), either
Proof. Consider 2 cases.
Think of \(\vdash\) as a proof, and \(\mu \vDash\) as saying \(\mu\) is a model.
Lemma
Theorem (SoS algorithm).
Planted clique: Take \(G(n,\rc 2)\), add a random \(\om\)-clique.
Whp the max clique of \(G(n,\rc 2)\) is of size \(c\ln(n)\), so for \(\Om(\ln n)\) size cliques there is quasipoly algorithm.
Theorem (SoS hardness). Let \(d(n)\) be a function. There is \(c\) such that for \(\om = n^{\rc 2 - c\pf{d}{\ln n}^{\rc 2}}\) and large enough \(n\), wp \(1-\rc n\) over \(G\sim G(n,\rc 2)\), there is a degree \(d\) pseudodistribution \(\mu\) over \(B^n\) that is consistent with \(x_ix_j=0\) for every \(i\nsim j\) in \(G\) and such that \(\wt{\EE_{\mu}} \sumo in x_i\ge \om\).
In contrast with Max-3XOR, in planted clique, each variable has a weak but GLOBAL effect on all other variables.
Definition. A degree \(d\) pseudoexpectation map is \(G\mapsto \mu_G, \deg (\mu_G)=d\). It is pseudocalibrated with respect to \(f\) if \[\sum_{G\sim G(n,\rc 2)} \wt{\EE_{G}} f_G = \EE_{(G,x)\sim G(n,\rc 2,\om)}f_G(x).\] (\(x\) is the characteristic vector of the planted clique.)
(Q: Why is the LHS \(G(n,\rc2)\) rather than \(G(n,\rc 2,\om)\)?)
Example: If \(f_G(x)=x_{17}\), then a first estimate could be \(\wt E x_{17}=\fc{\om}n\). But if we want to get higher degree functions, ex. covariance with degree, right, then we should not each \(\wt E x_i\) to \(\fc{\om}n\).
Proof (Sketch). Let \(\mu(G,x) = \mu_{planted}(G,x)_{\deg_x\le d, \deg_G\le \tau}\). Show it still satisfies normalization, etc. Hrd part is psd-ness.
Finish
Tensor decomposition with and without SoS
Plan
Problem
Application
Algorithm (high-level)
Estimate \(\sum a_i^{\ot 3}\) from samples
Random contraction algorithm (Jeinrich)
Consider the orthogonal case first. Note that the most difficult case is if all the norms are the same (if they are all different, we can use SVD).
… (see notebook)
Decompose random gaussian along 2 directions. If correlation large ,more likely to recover \(e_1\).
Claim. \(\E_g \ve{(I^{\ot 2}\ot g^T)X} \le \sqrt{\ln d} \si\) where \(\si = \max\{\ve{X}_{13,2},\ve{X}_{1,23}\}\).
This is a linear function of \(g\), we can write it as in the concentration inequality.
\[\begin{align} (I^{\ot 2} \ot g^T)X &=\sum g_i \ub{(I^{\ot 2} \ot e_i^T)X}{A_i}\\ \ve{X} &=\ve{X^TX}^{\rc 2} = \ve{XX^T}^{\rc 2}\\ \sum A_iA_i^T &= X_{3,12}^T X_{3,12}\\ \sum A_i^TA_i &= X_{1,23}^T X_{1,23} \end{align}\](check the indices).
Davis-Kahan: \(\ve{A-aa^T}\le o(1) \ve{a}^2\) implies that the top eigenvector of \(A\) is \(o(1)\) close to \(a\). So top eigenvector of \((I\ot I\ot a^T)X\) is close to \(a\).
Theorem. Let \(a_1,\ldots, a_n\in R^d\), \(\ve{a_i}=1\). Suppose \(\sum a_i a_i^T \preceq \si I\) (\(\si\) is overcompleteness), \(\max_{i\ne j} |\an{a_i,a_j}|\le \rh\). Then, given \(\sumo ih a_i^{\ot3}\), can recover vectors close to \(a_1,\ldots, a_n\) in polytime whenever \(\rh \si^2=o(1)\).
E.g., \(\si=d^{0.1}\), \(\rh = d^{0.3}\). This is satisfied by \(d^{1.1}\) random unit vectors, or \(n=d^{1.24}, (d^{.24},d^{-.5})\).
This algorithm can be poly in \(d\) and \(n\).
Open question: Is this true for \(\rh\si <o(1)\)? This would imply for \(d^{1.5}\) random vectors. (Though there is another argument that can give up to \(d^{1.5}\).) (Maybe a \(-\ep\) in exponent.)
Algorithm: given \(M_3\),
Use SoS to find tensor which we can run random contraction on, living in higher-dimensional space.
Under conditions of theorem, let \(b_i=a_i^{\ot 2}\), can show that random contraction works for \(X=\sum b_i^{\ot 3}\). (Show \(\ve{X}_{12,3}\le O(1)\) if \(\rh\si^2\ll 1\).) Recover the \(b_i\)’s, then recover \(a_i\)’s. But we don’t get \(b_i^{\ot 3}\), we just get \(a_i^{\ot 3}\). SoS: given \(a_i^{\ot 3}\) it’s computing \(b_i^{\ot 3}\). Find pseudodistribution matching 3rd moment and outputs 6th moments.
The intended \(\mu\) is uniform over \(a_1,\ldots, a_n\). \(\wt \EE_{\mu} x^{\ot 3} = \rc n \sum a_i^{\ot 3}\) and \(\E x^{\ot 6} = \rc n \sum a_i^{\ot 6} = \rc X\).
Philosophy: first prove statement for distributions, then verify that the steps fall in SoS proof system.8
When are 3rd moments enough for overcomplete tensors, \(n\gg d\)? Robustness against noise in input tensor?
We use lower-degree moments, have better error robustness, and develop new improved analyses of classical tensor decomposition algorithms.
SoS a priori not practical but we can extract practical algorithms
Dream up higher-degree moments and apply classical algorithms.
Algorithm: Magic box lifts 3rd moments to 6th moment, \(M_3=\sum_i a_i^{\ot 3}\) to \(M_6=\sum_i a_i^{\ot 6}\). Then apply Jennrich to get \(a_i^{\ot 2}\).
It’s not even clear this is information theoretically possible!
Ideal implementation: find \(D\) over unit sphere subject to \(\EE_D x^{\ot 3} = \rc n M_3\), then use \(\EE_D x^{\ot 6}\). Claim is that this is \(\approx \rc n M_6\).
Key inequality: by niceness of vectors, \(a_1,\ldots, a_n\) are approximate global maximizers. For all \(x\in \bS^{d-1}\), \[\begin{align} P(x) &\le \max_{i\in [n]} \an{a_i,x} + O(\rh si^2).\\ 1 &\approx \EE_D \sumo in \an{a_i,x}^3\\ \EE_{D} P(x) &\ge 1-o(1). \end{align}\]Also close to uniform over \(a_i\).
Remaining questions:
SoS is slow. Instead find direct algorithm to fool Jennrich, \[ \sum_{ijk} \an{a_i,a_k}\an{a_j,a_k} (a_i\ot a_j) \ot (a_i\ot a_j) \ot a_k\approx \sumo in a_i^{\ot 5} = M_5. \]
Tensor structure to implement Jennrich in time \(O(d^{1+\om})\).
Open: 3 tensors with \(d^{1+\ep}\) generic components, \(d^{1.5+\ep}\) random components?
Tensor power: fewer guarantees. Does better in amount of time needed to reduce error.
There is another way to relax this kind of problem, by changing the values \(x_i\) could take from \(\pm1\) to vectors, like in Goemans-Williamson. We don’t consider this kind of relaxation here. (Though we can do GW here too…?)↩
There is another way to relax this kind of problem, by changing the values \(x_i\) could take from \(\pm1\) to vectors, like in Goemans-Williamson. We don’t consider this kind of relaxation here. (Though we can do GW here too…?)↩
Not really. I’m confused here: why don’t we factor through \((\R[x]/I)_{\le l}\) instead of \((\R[x]/I)_{\le l/2}\times (\R[x]/I)_{\le l/2}\)?↩