Posted: 2016-08-30
, Modified: 2016-08-30
Tags: none
Threads
- Relaxing sparsity assumption: Independent sparsity is an unrealistic assumption. For example, many features tend to co-occur with each other. Re-analyze algorithms that rely on independent sparsity under looser conditions on the distribution.
- Two ways to loosen the conditions:
- Drop the condition of independence.
- A first step is “group sparsity”. See representation learning for references.
- A next step is to do away with independence entirely. (Certain distributions may be intractable… if so, obtain a hardness result and then add some looser assumption.)
- Drop the condition of sparsity.
- Instead have some condition on moments/tails so that each vector is well-approximated by a sparse vector, cf. “flatness”, ex. 99% of weight is on \(o\prc{\sqrt n}\) of coordinates.
- The difference with adding noise is that
- Noise is added to \(x\) (pre-coding) rather than \(y\) (post-coding)
- Noise can be correlated with the vector, not independent. (if it were independent, there’s not that much difference from adding on the \(y\)-side)
- Sparse recovery. Ex. basis pursuit.
- Dictionary learning
- Q (@Arora): How does dictionary learning perform in real life? What is it applied to? Applying it to problems traditionally solved with SVM’s and neural nets, how does it compare? How do AGM14, AGMM15 perform?
- Re-analyze AGM14, AGMM15 with group sparsity.
- Re-analyze with arbitrary sparselike distribution.
- Representation learning: general problem
- The main bottleneck is that the problem “sort-of” reduces to tensor decomposition, but I don’t understand much of how TD performs in theory and in practice.
- Read papers on TD (Ge, Anandkumar).
- PMI for images
- I don’t have any positive results when I build on top of the CKN.
- May try some alternatives: work with CIFAR instead, do some thresholding thing, do things in middle layer… (but unlikely to work if first thing didn’t work?)
- Do experiments on a more basic level.
- Try unsupervised multiclass SVM on both the original image and on the CKN features. How does this compare to clustering? (Look at the actual pics.)
- Have a programming setup where you can visualize clusters and features.
- DL for images
- If you do DL, do you obtain the categories? (ex. digits)
- If you do DL, is training on top of the DL easier?
- How does (H)DL compare to the CKN and other unsupervised methods? Make DL convolutional.
- What is the typical sparsity of a trained NN? (cf. dropout?)
- Use DL/weighted SVD with the natural proximity/distance in images. Get “context vectors” for different parts of the image. (Hierarchical fashion?) A global context vector would be the classification?
- (images not just DL/SVM because adding something in perpendicular direction wrecks things?)
@Arora: in what way is DL “just” tensor decomposition?
(Is there a relationship between weighted SVD for PMI and DL? It’s not at all clear to me. Somehow neural nets do something “like” DL (does the CKN do this though?) but then we’re trying to see if the features can be SVD’d with PMI. Learning with 2 “layers”:
- Representation - with NN/DL
- Classification - (on top) with SVD/PMI.
Also how do hierarchical methods come in?)
Things meriting a further look
- Traditional learning theory (ex. learning SAT)
- Probabilistic view of dictionary learning (LDA, hierarchical DP, etc.) @Bianca
- Pattern theory—how they model images. Traditional image things—wavelets etc.
Summary
Conversation with Arora
- “Show that backprop works.”
- Assume a generative distribution: \(x\) from sparse distribution, observe \(y\approx Ax\), there is a SVM classifier depending on \(x\). Show that backprop for a 2-layer NN works. (@Tengyu)
- PMI: restrict to features that co-occur.
- SoS for TD + AM for DL gives \(n^{O(\log n)}\) algorithm for sparsity \(n^{1-\ep}\). Check this (don’t need sparsity after initialization).
- (Hard) Kernel SVM’s can’t learn DL + classifier (essentially 2-layer NN).