Representation learning

Posted: 2016-07-28 , Modified: 2016-08-13

Tags: none

Representation learning means you have to first find a good representation—some hidden structure—for the data in order to learn it.

Low-dimensional structure in high-dimensional space

Consider a distribution on a low-dimensional space. Now suppose this low-dimensional space is embedded in a high-dimensional space, and noise in the complementary (may be orthogonal) subspace is added. Recover the low-dimensional space and the structure on this space.

Consider a dictionary-learning setting. There are many variations here. What we want to say is the following: given there’s a transformation in a certain class—ex. a neural net—that makes the data structured, can you learn the structure/parameters?

Thoughts 8-13

The problem here is really about unsupervised learning: given that there is a representation of a certain form on top of which you can train a classifier, etc., in the absence of labels find the representation.

Kernel dictionary learning

Let \(x\)’s be the data points and \(X=[\Phi(x)]\) be the image under \(\Phi\). \(\Phi\) is such that \(K(x,y) = \an{\Phi(x),\Phi(y)}\). Kernel dictionary learning seeks to find \(X=DA\), \(D\in \R^{\iy\times k}\). But assume \(D=XE\) is a linear combination of \(\Phi(x)\)’s. Then the objective becomes minimizing \(\ve{X-XEA}\) (what norm?) subject to sparsity constraints. Do alternting minimization, etc., using the fact that the norm (and gradients) can be evaluated easily using the kernel trick.

Try to use kernel dictionary learning

Warning: these steps are not justified.

Suppose that \(x\)’s are latent variables, that are sparse\(+\)noise, and we observe \[y=Af(x).\] In our case, \(f=\ln\) (so that sparse\(+\)positive noise becomes few large entries). Then (maybe we should have \(A^T\)? Not sure) \[\begin{align} A^+ y &= f(x)\\ f^{-1}(A^+y) &= x\\ \Phi(A^+)\Phi(y) &= x \\ \Phi(y) &=\Phi(A^+)^+ x. \end{align}\]

This suggests doing dictionary learning on \(\Phi(y)\) (or a finite approximation of this), finding \(x\)’s, then solving the system of equations \(A^TY=f(X)\). (Take many samples \((x,y)\) and do least squares.) In order for this to work, we need to

This does DL in the larger space though. Can we use the kernel trick to make it more efficient?

So an attempt at an algorithm.

  1. KSVD
  2. DL/ICA to find \(X\).
  3. Solve for \(A\).

Generalizing AGMM

Different optimization problem? (Ex. add sparsity regularization, ex. \(\ved_1\).) SDP relaxation? It’s hard to integrate the sparsity constraints.

Redo analysis with continuous observations?

Tensor decomp for mixed community detection [Anandkumar, Ge]: look at \((\E_{x\in X} \an{x,a}\an{x,b}\an{x,c})_{a,b,c}\).

I think I’m trying to do too many things at once. Allow larger noise, agnostic noise, arbitrary mean/cutoff…

Look at the simplest problem inspired by the problem. Ex. Assume there are \(k=10\) functions \(\si_k(A_k^Tx+b_k)\) (normalized) that map data to close to \(e_1,\ldots, e_{10}\)? Completely unsupervised learning. AGM is much simpler with sparsity 1…

DL vs. ICA, tensor decomposition

Why can’t we just solve dictionary learning with tensor decomposition? Suppose \(x_i\) are independent, symmetric, non-Gaussian, mean 0. (Symmetric mean 0 can probably be loosened if we modify this.) Then \[\E y^{\ot 4} -(\E y^{\ot 2})^{\ot 2} =\sum_i A_i^{\ot 4} [(\E x_i)^4 - (\E x_i^2)^2].\] We don’t even need the \(x_i\) to be from a sparse distribution—any non-Gaussian distribution (actually, want 4th moments to be non-Gaussian) will do!

Reasons why this doesn’t trivialize the problem?

  1. Overcomplete tensor decomposition is hard/not well-understood.
  2. The \(x_i\) aren’t completely independent. (But we can assume almost 4-wise independence, etc.—the AGMM algorithm also requires similar assumptions.)
  3. Tensor decomposition is inefficient.
  4. It doesn’t find \(x\)’s. (But we can do inference separately.)
  5. Tensor decomposition in this regime just doesn’t work in practice. (Practically, what can you expect???)

I think I need to familiarize myself with practical tensor decomposition and DL!

Previous work

Unsupervised SVM

Relaxing independent sparsity

Independent sparsity—required for AGM14 and AGMM15 (actually, 2-wise “almost independence” is good)—is unrealistic because features may be correlated/co-occur. A first step may be to consider group sparsity.

Google for “dictionary learning group sparsity”: