Weekly summary 2018-11-10

Posted: 2018-11-07 , Modified: 2018-11-07

Tags: none

Projects

Multimodal sampling

Show mixing for Dirichlet process sampling. (This is a basic distribution and Markov chain, which is often combined with other things, e.g. in Dirichlet process mixture models.)
Show that Gibbs sampling (keep track of cluster membership, sample from mean, and then re-cluster) with merge/split succeeds for mixture of Gaussians.
- Big problem: how to do merge/split step.
  - In general this is NP-hard. (2-means is NP-hard.)
  - Use a good enough proposal distribution, like 1 step of Gibbs in each coordinate.
- First do the well-separated case. Instead of canonical paths, use a “Lyapunov function” argument to show that it’s on average getting closer to the right clustering.
- Note I keep track of the cluster assignments, not the means. The cluster assignments can “integrate out” the mixing coefficients, not so with the means (?). (My original idea was to do Gibbs with each mean, hoping it’s a tractable multimodal distribution (mixture of gaussians)
Tensor decomposition
- Overcomplete tensor decomposition (noiseless), [GM16]
- \(\la u^{\ot 3} + \rc{\sqrt n}W\), [ADGM16]
- Simply flattening out doesn’t seem to work, because it doesn’t increase the “attraction region.”
- Can try convolving with gaussian on the sphere to flatten local minima (but will this increase attraction region? do we need that?)
- Check previous notebook first (around 2017/9)
Short-term long-term memory (Koolen & Warmuth’s problem)
- This is like clustering but with an exponential factor that disincentivizes switches at adjacent times (the graph is a line). Does that make it easier or harder? Seems like it could be easier since you don’t effectively get \(2^n\) possible splits. The split step is simple, you can check all possibilities. (Problem: if local is a lot better than the global… but maybe if generative model is true this won’t be true?)
- Use log-concavity.
- Or: do sequential MC, e.g. with particle filters, estimating Z.
  - In the generative case, each correct mode will have enough mass (?).
Other possibilities
- Dictionary learning, sparse coding (?). Problem: if not incoherent, can get stuck in local min?
- Sparse logistic regression. Phase transitions, hardness results - check them out.
- Inspire NN training.
  - Split/merge or delete/add.

Online sampling

See followups.
Experiments.
For discrete distributions (main thing: need \(f\)’s to change slowly, OK with Gibbs sampling coordinate-by-coordinate, not OK with bipartite Gibbs sampling)
Understand RBM training, how can we help there? What’s the right way to take the stochastic gradient there and can we reduce the variance? (may not satisfy above conditions…)
Check out ICML paper on stochastic gradient Gibbs sampling.

LDS

Main obstacle was “projection to random subspace”, use the insight that directions are almost orthogonal
Identification, using Laplace instead of Fourier, @Musco

Compositional function spaces: dynalist

Probabilistic model for grammar

Things to clarify

The relationship between sampling and optimization (see [RN], Cesa-Bianchi…)
What is a Bayesian relaxation? What are examples of its use? (Check out “sleeping” - is it an example?) Relationship between the finite (ex. k-means) and infinite (ex. CRP) versions.