(HM16) A non-generative framework and convex relaxations for unsupervised learning

Posted: 2016-12-27 , Modified: 2016-12-27

Tags: unsupervised, convex relaxation

Framework

Given

  1. Instance domain \(X\) and target space \(Y\) (ex. \(\R^d\), \(\R^k\), \(d\gg k\)).
  2. Unknown distribution \(D\) on domain \(X\). (NOT assumed to be generated by a function in \(H\).)
  3. Hypothesis class of decoding and encoding pairs \[ H \subeq \{X\to Y\}\times \{Y\to X\}.\]
  4. Loss function \(l:H\times X\to \R_{\ge0}\). (Ex. \(\ell_2\) loss \(\ve{g(h(x))-x}_2^2\).
    • Define \[ \text{loss}_D(f) = \EE_{x\sim D} l(f,x). \]
    • Define sample loss \[ \text{loss}_S(f) = \rc m \sum_{x\in S}l(f,x). \]

\(D,X\) is \((k,\ga)\) \(C\)-learnable wrt \(H\) if there exists an algorithm that given \(\de,\ep>0\), after seeing \(m(\ep, \de) = \poly\pa{\rc \ep, \ln \prc{\de}, d}\) examples, returns \((h,g)\) (NOT necessarily in \(H\)) such that

  1. wp \(\ge 1-\de\), \[\text{loss}_D((h,g)) \le \min_{(h,g)\in H} \text{loss}_D((h,g)) + \ep + \ga.\]
  2. \(h\) has an explicit representation with \(\le k\) bits.

(\(\ga\) is the bias.)

(Q: ? how to understand “\(\le k\)” bits?)

Generalization theory

\[\begin{align} \wh f_{ERM} &= \amin_{f\in H} \text{loss}_S(\wh f)\\ R_{S,l}(H) &= \EE_{\si\sim \{\pm 1\}^m} \ba{ \sup_{f\in H} \rc m \sum_{x\in S}\si_i l(f,x) }\\ \Pj\pa{ \text{loss}_D(\wh f_{ERM}) \le \min_{f\in H} \text{loss}_D(f) + 6 R_m(H) + \sfc{4\ln \prc{\de}}{2m}}\ge 1-\de. \end{align}\]

PCA

For \(H_k^{pca}\), \[\begin{align} X&=\R^n\\ Y&=\R^k\\ h(x) &= Ax\\ g(y) &= A^+y\\ l((g,h),x) & = \ve{(A^+A - I)x}^2. \end{align}\]

For \(H_{k,s}^{pca}\), replace \(d\) by \(d^s\), \(x\) by \(x^{\ot s}\).

\(H_{k,s}^{pca}\) is efficiently \((k,0)\) C-learnable.

(Proof: Use SVD. Rademacher complexity is \(O\pf{d^s}{m}\).)

Spectral autoencoder

For \(H_{k,s}^{s,a}\), \[\begin{align} h(x) &= Ax^{\ot s},& A&\in \R^{k\times d^s}\\ g(y) &= v_{\max}(By),& B&\in \R^{d^s\times k} \end{align}\]

where \(v_{\max}\) is max eigenvector of tensor.

For \(s=2\):

\(D\) is \((k,\ep)\)-regularly spectral decodable \(A\in \R^{k\times d^2}\), \(B\in \R^{d^2\times k}\) with \(\ve{BA}\le \tau\) such that for \(x\sim D\), \[ M(BAx^{\ot 2}) = xx^T + E, \quad \ve{E}\le \ep.\] Here \(M\) means matrix representation.

(order of qualifiers for \(\tau\)?)

\(H_{k,2}^{sa}\) is \(\pa{O\pa{\tau^4k^4}{\de^4}, \de}\) \(C\)-learnable in poly-time.

Proof.

  1. Rademacher complexity bound for \(\Phi = \set{\ve{Rx^{\ot 2}-x^{\ot 2}}}{\ve{R}_{S_1}\le \tau k}\): \[R_m(\Phi) \le 2\tau k \sfc1m.\] (CHECK. Omitted.)
    • \(L^2\) bound on \(x\) gives \(L^1\) bound on \(x^{\ot 2}\), which is why we need \(S_1\) bound.
  2. \(f(R) := \E[\ve{Rx^{\ot 2}-x^{\ot 2}}]\) is convex and 1-Lipschitz.
    • Proof. \((u\ot v)(x^{\ot 2}^T)\) is a subgradient, where \(u,v\in \R^d\) are top left/right singular vectors of \(M(Rx^{\ot 2} - x^{\ot 2})\).
  3. Optimization: Use (non-smooth) Frank-Wolfe algorithm (CHECK).
    • Each update is rank 1.
    • \(R^*\) is a feasible solution with \(\ve{R^*}_{S_1}\le \rank(R^*)\ve{R}\le \tau k\), so the diameter of \(\Phi\) is \(\le \tau k\).
    • \(f\) is 1-Lipschitz.
    • In \(O\pf{\tau^4k^4}{\de^4}\) steps get \(\wh R\) with \(f(\wh R)\le \de + \ep\).

Dictionary learning

\[\begin{align} h_A(x) &= \amin{\ve{y}_\be\le k} \rc d |x-Ay|\\ g_A(y) &= Ay\\ A&= \amin_{\ve{A}_\al \le c_\al} \EE_{x\sim D} \ba{ \min_{y\in \R^r: \ve{y}_\be \le k} \rc d |x-Ay|_1 }. \end{align}\]

Ex. \(\ved_\al\) is max (column \(\ell_2\) or \(\ell_\iy\) norm) and \(\ved_b\) is \(\ell_0\).

Here, use max column \(\ell_\iy\) norm (i.e., \(\ell_1\to \ell_\iy\)) for \(A\) and \(\ell_1\) norm for \(B\). (Q: How does this compare to usual setting? \(\ell_1\) norm is looser than \(\ell_0\), OK. \(\ell_\iy\) is also looser than \(\ell_2\).)

For any \(\de>0, p\ge 1\), \(H_k^{dict}\) is C-learnable with encoding length \(\wt O\pf{k^2 r^{\rc p}}{\de^2}\), bias \(\de+ O(\ep^*)\), and sample complexity \(d^{O(p)}\) in time \(n^{O(p^2)}\).

Interpretation:

Note this isn’t doing “dictionary learning” in the usual sense of finding the dictionary. It only shows that you can reduce dimension and still preserve information. It does not “undo” the dictionary multiplication and give a useful representation, only a succinct one.

Group encoding and decoding

This is a simpler algorithm which achieves a weaker goal of encoding multiple examples at a time.

Given \(N\) points \(X\in \R^{d\times N}\sim D^N\), convex set \(Q\),

  1. Group encoding \(h(X)\):
    • Compute \(Z=\amin{C\in Q} |X-C|_1\).
    • Random sample \(B\) taking each entry with probability \(\rh\), \(Y=P_{\Om}(Z)\).
  2. Group decoding \(g(Y)\):
    • \(\amin_{C\in Q} |P_\Om(C)-Y|_1\).

We need a conex relaxation \(Q\) \[ \set{g_{A^*}(h_{A^*}(X))}{X\in \R^{d\times N}} \sub \set{AY}{\ve{A}_{\ell^1\to \ell^\iy}\le 1, \ve{Y}_{\ell_1\to \ell_1}}\sub Q. \] with low sampling Rademacher width and that is efficiently optimizable.

  1. (Lemma 5.2) If a convex set has low sampling Rademacher width, then we can compress it by random sampling.
  2. Define the factorable “norm” \(\Ga_{\al,\be}(Z) = \inf_{Z=AB}\ve{A}_\al\ve{B}_\be\). \(\Ga_{1,q,1,t}(\cdot) = \Ga_{\ell_1\to \ell_q, \ell_1\to \ell_t}\) is a norm (5.3-4). \(Q_{1,\iy,1,1} = \set{C\in \R^{N\times d}}{\Ga_{1,\iy,1,1}(C)\le k}\) has low SRW (5.5).
  3. This is not efficiently optimizable. Instead consider the SoS relaxation \[\begin{align} Q_p^{sos} &= \set{C\in \R^{d\times N}}{\exists \text{degree-}O(p^2)\text{pseudo-expectation $\wt\E$ that satisfies }A(C)}\\ A(C) & = \set{C=AB}\cup \bc{ \forall i,j, B_{ij} = b_{ij}^{p-1}, \sumo lr b_{lj}^p \le k^{\fc p{p-1}},\forall i,j, A_{ij}^2\le 1}. \end{align}\] Now redo the proof for low SRW using a SoS proof. Motivation: for \(Q\) we use an inequality involving \(L^1,L^{\iy}\). In oreder to convert to SoS proof, we need to turn this into \(L^{\fc{p}{p-1}},L^p\).