(BH16) Algorithms for generalized topic modeling
Posted: 2016-08-12 , Modified: 2016-08-12
Tags: topic modeling
Posted: 2016-08-12 , Modified: 2016-08-12
Tags: topic modeling
Each document has two views \((x^1,x^2)\), ex., representing two paragraphs. There are \(k\) topic vectors \(v_1,\ldots, v_k\).
Problem: find \(w_i\).
Proof.
With \(O\pa{\fc{n-k}{\ze}\ln \prc{\de}}\) samples, with probability \(1-\de\), \(\rank(Z)=n-k\).
Proof. Induct: \(\rank\set{x_i^1-x_i^2}{i<\fc{j}{ze}\ln \pf{n}{\de}}<j\) with probability \(\fc{j\de}{n}\).Takig \(m=\Om(\rc{\ze}\ln \pf{k}{\de})\) samples, wp \(1-\de\), \(\{a_1,\ldots, a_k\}\) are extreme points of \(\conv(S_{\parallel})\).
Take \(m_2\) fresh samples and let \(\wh S_{\parallel} = \{\wh P \wh x_i^1\}\).
Let \(\ep' = \fc{\ep}{8r}\). Remove \(\wh x_{\parallel}\) from \(\wh S_{\parallel}\) if \(<\fc{p_0 \xi m_2}2\) points in \(B_{\fc{\ep'}2}(\wh x_{\parallel})\).If \(d(\wh x_{\parallel}, \conv (\wh S_{\parallel}\bs B_{6r \ep'}(\wh x))\ge 2\ep'\), add \(\wh x_{\parallel}\) to \(C\).
Cluster \(C\) by single linkage (i.e. put in same cluster if \(\le\)) \(16r\ep'=2\ep\). Return any point from cluster \(i\) as \(\wh a_i\).
Here, \(r\) is the smallest value such that \(d(a_i, \conv(\De\bs B(a_i,r\ep))) \ge \ep\).
Proof.
Show whp \(\ve{P-\wh P}_2\le \ep\).
Let \(D=X_1-X_2\). Split \[ \rc m \wh D \wh D^T - \si^2 I_n -\rc{m} DD^T = \pa{\rc m EE^T - \si^2 I_n} + \pa{\rc m DE^T + \rc m ED^T}. \] Establish 2 properties:
(Denoising) For appropriate \(\ep'\), \(m\), for any \(x\in \wh S_{\parallel}\), \(d(x,\De)\le \ep'\) (points that are kept are close), and for all \(i\in [k]\), there is \(\wh a_i\in \wh S_{\parallel}\) such that \(\ve{\wh a_i-a_i}\le \ep'\).
Proof Use Gaussian anticoncentration, and the fact that in a large sample set, the fraction of samples within any region is close to the density of the region.Returns \(\ve{a_i-\wh a_i}\le \ep\). Use the definition of \(r\) and some geometry.
Any algorithm requires \(\Om(n)\), not \(\Om(k)\) samples.
Further study: