NMF algorithm
Posted: 2016-04-22 , Modified: 2016-04-22
Tags: NMF
Posted: 2016-04-22 , Modified: 2016-04-22
Tags: NMF
The setup for alternating minimization is as follows.
The plan is to show this satisfies the conditions needed for approximate gradient descent: each step is correlated with the right direction. We want to find \(\al,\be,\ep\) so that \[ \an{g,A-A^*} \ge \al \ve{A-A^*}^2 + \be \ve{g}^2 - \ep\] (for some norm).
We have to be careful about 2 things.
We proceed in 2 lemmas.
As a first step, show that \[\an{\EE_y \pa{-y\odot \pa{\rc{Ax} -\rc{A^*x}}x^T} , A-A^*}>0.\]
using \(\E y = A^* x\).
Now relax \(x\approx x^*\). (Warning: \(x^*\) simply being close may not imply convergence to \(A^*\) without additional assumptions on \(A^*\), because of nonuniqueness.) \[\begin{align} \EE_{x,y} \an{-D yx^T, A-A^*} &= \EE_{x,y} [-y^T D(A-A^*) x]\\ &= \EE_{x,y} [-y^T DA^*x]-1\\ \end{align}\]The dependencies of the random variables are \(x^* \to y \to x\) (\(\E_y = A^*x^*\)). We can’t simply replace \(\E y = A^*x^*\) because we have to average over \(x\) first, which depends on the decoding. (If we could replace \(y\) like that, we get \(\an{x,x-x^*}_M\).)
Recall that if we’re decoding by multiplying by \(B\), we also have to threshold, \(\text{Th}(Bx)\).
In Theorem 4.1, if \(A\) is biased, then we instead obtain a bound \[ (By)_i - \E x_i = \ub{(By)_i - BA^*x}{\text{w.h.p. }\le 2\la(A) \sfc{\ln r}{n}} + B(A^*-A) x^*. \] For the second term, \[ \ve{B(A-A^*)x}_{\iy} \le |B|_{\iy} \max_i \ve{A_{\cdot i} - A_{\cdot i}^*}_1 \le \la \ep.\] We want to lower bound \[\begin{align} \an{\nb A, A-A^*} &= \sum \fc{y_iA_{ij}^*}{(Ax)_i} - 1\\ &= \sum_i \fc{y_i(A^*x)_i}{(Ax)_i} - 1\\ &= \sum \fc{b_i(A^* x)_i}{(Ax)_i} + \sum_i \fc{(A^*(x^*-x))_i (A^*x)_i}{(Ax)_i} + \sum_{i,j} \pa{(A^*x)_i\sfc{(Ax)_j}{(Ax)_i} - (A^*x)_j\sfc{(Ax)_i}{(Ax)_j}}^2. \end{align}\]We may suppose \(\ve{A_{\bullet i} - A_{\bullet i}^*}\le \rc{\poly\log(n)}\), or something like this.
Try 2: INCORRECT: I mixed up \(x,x^*\) here. \[\an{\nb A, A-A^*} = \an{\pa{y_i \sfc{(Ax)_j}{(Ax)_i} - y_j \sfc{(Ax)_i}{(Ax)_j}}_{ij}, \pa{(A^*x)_i \sfc{(Ax)_j}{(Ax)_i} - (A^*x)_j \sfc{(Ax)_i}{(Ax)_j}}_{ij}}. \] It’s tempting to take \(\E_y\) first, but we can’t do that.
If \(x=x^*, y= A^*x^*\) then we write this as a sum of squares above. (This is probably the same as writing \(\ve{x}_M^2\) from the previous section…) This is Lagrange’s identity.
We want to lower-bound by \[ \al \ve{A-A^*}_F^2 + \be \ve{\pf{y_i}{(Ax)_i}}_2^2 \ve{x}_2^2 - \ep. \]
(5-12)
Idea: use a different potential function. \(L^2\) doesn’t make so much sense here.
Suppose \(x= (1)\) and do MWU. The gradient is \[-\diag\prc{Ax} yx^T = -\fc{y}{a}.\] We get \[a_i' = \fc{a_ie^{-\eta \fc{y_i}{a_i}}}{\sumo kn a_ke^{\eta \fc{y_k}{a_k}}}.\] The metric is \[d_{KL}(y,a) = \sumo in y_i \ln \pf{y_i}{a_i}.\] The decrease is \[\begin{align} d_{KL}(y,a) - d_{KL}(y,a') &= \sumo in y_i \ln\pf{a_i'}{a_i}\\ &=\sumo in y_i \ln \fc{e^{\eta\fc{y_j}{a_j}}}{\sumo jn a_j e^{-\eta \fc{y_j}{a_j}}}\\ &=\sumo in y_i \ln e^{\eta \fc{y_j}{a_j}} - \ln \sumo jn a_j e^{-\eta\fc{y_j}{a_j}} \end{align}\]Sanity check: For \(\eta\to 0\), this is \(\eta(\sum \fc{y_i^2}{a_i}-\sum y_i)>0\) by \(T_2\)’s inequality.