(AGMM15) Simple, efficient, and neural algorithms for sparse coding

Posted: 2016-03-25 , Modified: 2016-03-25

Tags: dictionary learning, sparse coding

Problem/Model and Assumptions

Given many independent samples \(y=A^*x^*+N\), recover \((A^*,x^*)\). Here,

  1. \(A^*\in \R^{n\times m}\) is a fixed matrix. We assume:
    • \(A\) has unit norm columns.
    • (Norm is like that of a random matrix) \(\ve{A^*}=O\pa{\sfc{m}{n}}\).
    • \(A\) is \(\mu\)-incoherent, i.e., \(\an{A_i,A_j}\le \fc{\mu}{\sqrt n}\).
  2. \(x^*\) is drawn from a sparse distribution. Specifically, \(x^*\sim D\) where
    • Always, \(\Supp(x^*)\le k\).
    • \(\Pj(i\in S)=\Te\pf{k}{m}\).
    • \(\Pj(i,j\in S) = \Te\pf{k^2}{m^2}\).
    • \(\E[x_i^*|x_j^*\ne 0]=0\) (why do we care about the conditioning here?) and \(\E[x_i^{*2}| x_i^*\ne0]=1\).
    • When \(x_i^*\ne 0\), \(|x_i^*|\ge C\) for some constant \(C\le 1\). (This is an unreasonable assumption, but it makes sure the thresholding doesn’t cut off actual coordinates.)
    • Conditioned on the support, the nonzero entries are pairwise independent and subgaussian. Subgaussian means \(\exists C,v>0, \Pj(|X|>t)\le Ce^{-vt^2}\) or equivalently \(\exists b, \E e^{tX}\le e^{b^2t^2/2}\).
  3. The noise \(N\) is Gaussian and independent.1

Algorithm

The algorithm is alternating minimization. In alternating minimization, we want to minimize a function \(f(x,y)\). It is difficult to minimize with respect to \((x,y)\) together (e.g., because \(f\) is nonconvex when taken as a joint function of \(x,y\)) but easy to minimize with respect to \(x\) or \(y\) while keeping the other fixed (e.g., because \(f\) is convex in \(x\) and in \(y\)). Alternate between

For example, we can take gradient steps.

There are no results for general nonconvex \(f\); theoretical results tend to be ad hoc (for specific \(f\)). AGMM make a general technique that can be used to analyze alternating minimization algorithms, approximate gradient descent.

Version 1

The algorithm is

The decoding step is doing a minimization with respect to \(x\)2. Intuitively, since \(A^*\) is incoherent (random-like), \((A^*)^TA^*\approx I\), so multiplying by \((A^*)^T\) has the effect of inverting. The threshold acts as a denoiser.

Note that we don’t quite take the gradient step: the gradient is \(2(y^{(i)} - Ax^{(i)})^T x^{(i)}\); we only take the sign of \(x^{(i)}\).3

(There’s no regularization here?)

Version 2

A better version of the above is to use gradient descent (Olshausen-Field), rather than just the sign of the difference. Let \(A'=A - \eta g\) where \(g=\E [(y-Ax) x^T]\). The complication is that this may not prserver \((\de,2)\)-nearness, so the update rule is replaced by

\[\begin{align} \wh g^s &=\rc p \sumo ip (y^{(i)} - A^{(s)}x^{(i)}) \sign(x^{(i)})^T\\ A^{s+1} &= \Proj_B(A^s - \eta \wh g^s) \end{align}\]

where \(B\) is the set of \((\de_0,2)\)-close matrices to \(A^*\).

Instantiation

The algorithms above require a matrix that is \(O^*\prc{\log n}\)-close to the true matrix. Algorithm 3 gives such a matrix.

Approximate gradient descent

A general condition for a “descent”-type algorithm to work is that we make a step correlated with the direction to the optimum each time.

Definition: \(g\) is \((\al,\be,\ep)\)-correlated with \(z^*\) if \[\an{g,z-z^*} \ge \al\ve{z-z^*}^2 + \be \ve{g}^2 - \ep.\]

An algorithm which makes correlated steps has geometric convergence. For example, gradient descent on strongly convex, smooth functions gives geometric convergence.

Proposition: If \(f\) is \(2\al\)-strongly convex and \(\rc{2\be}\)-smooth, then \(\nb f(z)\) is \((\al,\be,0)\)-correlated with \(z^*\).

Proof: Note that \(L\)-smooth means \(f(x) - f(x^*) \ge \rc{2L} \ve{\nb f(x)}^2\) (cf. gradient descent lemma). Then \[\begin{align} \an{g,x-x^*} &\ge f(x)-f(x^*) + \al\ve{x-x^*}^2\\ &\ge \be \ve{\nb f}^2 + \al\ve{x-x^*}^2. \end{align}\]

Theorem (Theorem 6): If \(g\) is \((\al,\be,\ep)\)-correlated with \(z^*\) and \(z'=z-\eta g\) and \(0<\eta\le 2\be\), then \[\ve{z'-z^*}^2 \le (1-2\al \eta)\ve{z-z^*}^2 + 2\eta \ep.\] If \(z_{t+1}=z_t - \eta g_t\) and each \(g_t\) is \((\al,\be,\ep)\)-correlated with \(z^*\), \[\ve{z_t-z^*}^2\le (1-\al \eta)^t \ve{z_0-z^*}^2.\]

(Note the usual gradient step size for \(\rc{2\be}\)-smooth functions is \(2\be\).)

Proof: Break up \(z'-z^* = (z-z^*) - \eta g\) in order to use the correlation. \[\begin{align} \ve{z'-z^*}^2 &=\ve{z-z^*} - 2\an{z-z^*, \eta g} + \eta^2\ve{g}^2\\ &\le \ve{z-z^*}^2 - 2\eta (\al\ve{z-z^*}^2+\be \ve{g}^2-\ep) + \eta^2\ve{g}^2\\ &= (1-2\eta \al)\ve{z-z^*}^2 +\ub{ \eta (\eta - 2\be) \ve{g}^2}{\le 0} + 2\eta \ep. \end{align}\]

In actuality, we’ll need a weaker form of this, “\((\al,\be,\ep_s)\)-correlated with high probability” (Definition 38). There is an analogue of the theorem in this setting (Theorem 40).

Theorems

Say \(A\) is \((\de,\ka)\)-near to \(A^*\) if

In detail: Let \(\de = O^*\prc{\log n}\). (Here \(x=O^*(y)\) means there exists \(c\) such that the statement is true for \(x\le cy\).4)

Proofs

We cheat in this proof; for the correct version see Appendix D. We calculate \(\E g\), the expected value of the gradient step, and pretend that is the actual gradient step (this is OK for infinite sample size). For the real version, we’d have to consider concentration around \(\E g\).

The strategy is to show that step is correlated with the actual minimum, and then use the theorem on approximate gradient descent. First, we need to show that the minimum \(x\) will have the right support with high probability.

Initialization

Bird’s eye view

Questions


  1. What is the magnitude? I think this is responsible for the \(\ga\)’s that appear, but which are swept under the rug.

  2. Is it the minimum?

  3. Is this important, or just for simplicity of analysis?

  4. This is to disambiguate from the other meaning of \(O\), which is “whenever \(x\le cy\) for some \(c\), there are constants such that the result holds.”

  5. Are we implicitly reordering the columns here?

  6. Do we need concentration too?