Show mixing for Dirichlet process sampling. (This is a basic distribution and Markov chain, which is often combined with other things, e.g. in Dirichlet process mixture models.)
Show that Gibbs sampling (keep track of cluster membership, sample from mean, and then re-cluster) with merge/split succeeds for mixture of Gaussians.
Big problem: how to do merge/split step.
In general this is NP-hard. (2-means is NP-hard.)
Use a good enough proposal distribution, like 1 step of Gibbs in each coordinate.
First do the well-separated case. Instead of canonical paths, use a “Lyapunov function” argument to show that it’s on average getting closer to the right clustering.
Note I keep track of the cluster assignments, not the means. The cluster assignments can “integrate out” the mixing coefficients, not so with the means (?). (My original idea was to do Gibbs with each mean, hoping it’s a tractable multimodal distribution (mixture of gaussians)
This is like clustering but with an exponential factor that disincentivizes switches at adjacent times (the graph is a line). Does that make it easier or harder? Seems like it could be easier since you don’t effectively get \(2^n\) possible splits. The split step is simple, you can check all possibilities. (Problem: if local is a lot better than the global… but maybe if generative model is true this won’t be true?)
Use log-concavity.
Or: do sequential MC, e.g. with particle filters, estimating Z.
In the generative case, each correct mode will have enough mass (?).
Other possibilities
Dictionary learning, sparse coding (?). Problem: if not incoherent, can get stuck in local min?
For discrete distributions (main thing: need \(f\)’s to change slowly, OK with Gibbs sampling coordinate-by-coordinate, not OK with bipartite Gibbs sampling)
Understand RBM training, how can we help there? What’s the right way to take the stochastic gradient there and can we reduce the variance? (may not satisfy above conditions…)
Check out ICML paper on stochastic gradient Gibbs sampling.
LDS
Main obstacle was “projection to random subspace”, use the insight that directions are almost orthogonal
Identification, using Laplace instead of Fourier, @Musco
The relationship between sampling and optimization (see [RN], Cesa-Bianchi…)
What is a Bayesian relaxation? What are examples of its use? (Check out “sleeping” - is it an example?) Relationship between the finite (ex. k-means) and infinite (ex. CRP) versions.