Posted: 2016-09-26
, Modified: 2016-09-26
Tags: none
Threads
- PMI - get some results!
- CKN - understood theory, RKHS (9/26)
- SoS - lectures 2 and 3
- DL: do experiments suggested in DL generalization
- NN learns DL? Preliminary calculations.
- Stability of SGD
- @Elad: do local minima generalize?
- RL and cousins: see below. Notes
Meeting with Arora
Directions
- Learning noisy or-nets with 3-tensors (\(n^4\) time) (@Andrej, @Tengyu)
- Matrix factorization assuming separability
- Mike Collins NLP. Words as features.
- Logliear/NN
- For bigram classifiers, it’s fine to use word embeddings. (@Tengyu)
- Systems with memory
- Learning HMM’s with tensor methods
- Reinforcement learning (MDP’s)
- Theoretical algorithms are polynomial in state size and mixing time.
- What if we have a succinct reprsentation? @Mengdi Wang. Vector in \(\R^n\).
- Stochastic Primal-Dual Methods and Sample Complexity of Markov Decision Process. (recent results on learning in MDPs via a primal-dual approach, and new ideas on finding compact representations of MDPs via low-rank approximation.)
- cf. contextual bandits (@Elad, @Karan)
- Representation learning
- PMI for images
- Dictionary learning
- Relax incoherence. Only correlated with \(n^\de\) others.
- @Elad: Training on top of improperly learned dictionary.
Thoughts about PMI
(9/28)
In the CKN, given that one layer is \(x\), the next layer (before pooling) is computed as \(y_i=(e^{v_i^x+b_i})_i\) for some \(v_i,b_i\). We have that the dimension of \(y\) is larger than the dimension of \(x\).
This looks very much like in PMI for word vectors, where the probability of word with vector \(v\) given context \(x\) is \(e^{-v^Tx}\), and the low-rank approximation to the PMI matrix recovers the \(v\)’s.
But does that mean applying weighted SVD for PMI for the CKN feature vectors is somehow just trying to recover the \(v_i\)? In that case the dimension reduction would just be going from \(y\) back to \(x\), which doesn’t help classification.
(This doesn’t take into account the Gaussian pooling though.)
What would be the “test” for interpretability? For word embeddings, the test was analogy completion.
If the PMI matrix is low-rank, then what do we get beyond the fact that the feature vectors (7200-dim) came from a lower-dimensional (28x28) space? (In what sense would we expect the dimension-reduced feature vectors to be more interpretable than the original image?)
TODO:
- Try thresholding + WSVD + SVM.
- Do pipeline (data exploration) with other MNIST and CIFAR architectures.
- Think about the interpretability test above.
- @Tengyu on interpreting CKN as NMF.