Weekly summary 2016-08-06

Posted: 2016-08-06 , Modified: 2016-08-06

Tags: none

Representation learning

Machine learning
- PMI images
  - Understood PMI idea more thoroughly.
  - Got code from Ben. Having trouble running. See
    - https://www.mathworks.com/matlabcentral/newsreader/view_thread/345876#947272
    - https://askrc.princeton.edu/question/255/matlab-on-nobel-error-using-mex/
    - http://stackoverflow.com/questions/38723051/error-using-mex-g-error-no-such-file-or-directory
  - Brainstorm on convnets. Skimmed the following. TODO: understand more thoroughly.
    - Fastfood - computing Hilbert space expansions in loglinear time
      - Deep-fried convnets
      - Ailon-Chazelle
    - HA16 Unsupervised Learning of Word-Sequence Representations from Scratch via Convolutional Tensor Decomposition
    - [MKHS14] Convolutional Kernel Networks
- Representation learning problem (see below)
  - Talked with Anand.
  - Kernel dictionary learning idea. TODO: try to make this rigorous, and test it out.
- TODO: Think about AGD, talk with Cyril.
  - Chebyshev polynomials
Probability
- HDP Chapters 2-3
TT/ATP/PL
- Reviewed HMTI, read more on type theory. See Types and PL, Type theory.

Representation learning

In dictionary learning, we assume we have samples \(y = Ax + e\) where \(x\) comes from a sparse distribution (ex. \(x_i\) independent, \(x_i\neq 0\) with probability \(s/n\) and then is drawn from some distribution not concentrated at 0) and \(e\) is error (ex. Gaussian).

The way we stated our problem is that \(x\cdot a_i\) is large for only a few \(i\). This is similar to dictionary learning with \((A^+)^T\) where the columns of \(A\) are the \(a_i\). (I.e. the \(x\)’s here are really the \(y\)’s in DL.)

I may be wrong but I think that what’s different is that

Dictionary learning on \((A^+)^T\) would correspond to when \((x\cdot a_i)_i\) is sparse + noise. Our assumption is a bit different that \(x\cdot a_i\) is large for only a few \(i\), ex. large negative values don’t count against the assumption.
we’re trying to relax the condition on the noise—ex. instead of saying that the noise in the other coordinates is random, we consider worst-case or make an assumption that they’re random-like in some way.

(Actually, I think the undercomplete case when the number of \(a_i\) is less than the dimension of \(n\) doesn’t quite correspond to DL because the map \(x\mapsto (x\cdot a_i)_i\) is not invertible…)