(ALLMR16) RAND-WALK - A latent variable model approach to word embeddings

Posted: 2016-03-15 , Modified: 2016-03-15

Task
Previous work, observed phenomena
- Mysteries
Model
Explanation
Algorithm
Followup

Task

Given a corpus (a long sequence of words, e.g. the text of Wikipedia) to learn from, answer analogy questions such as ?:queen::man:woman.

Previous work, observed phenomena

The usual approach is to learn “word vectors.”

Define a context of a word $w$ in the corpus to be the two words (say) on either side of $w$. (Thus, for a context $\chi=(w_{-2},w_{-1},w_1,w_2)$, $\Pj(\chi|w)$ means the probability of observing the words of $\chi$ in a window of length 2, given that the middle word is $w$.) (Mikolov)

Pennington et al. and Levy and Goldberg posit the following approach.

Model: If $a:b::c:d$ is an analogy, then \[ \fc{\Pj(\chi|a)}{\Pj(\chi|b)} \approx \fc{\Pj(\chi|a)}{\Pj(\chi|b)}.\]
Thus, the solution to the analogy $?:b::c:d$$ is \[ \amin_w \sum_\chi \pa{ \ln \pf{\Pj(\chi|w)}{\Pj(\chi|b)} - \ln \pf{\Pj(\chi|c)}{\Pj(\chi|d)} }^2.\]
Define probability-mutual-information \[ PMI(w,\chi) = \ln \pf{\Pj(w,\chi)}{\Pj(w)\Pj(\chi)} = \ln \pf{\Pj(\chi|w)}{\Pj(\chi)}.\] Let $C$ be the set of possible contexts. For a word $w$, let $v_w$ be the vector indexed by contexts $\chi \in C$, \[ v_w(\chi)= {PMI}(w,\chi) = \ln \pf{ \Pj(w,\chi)}{\Pj(w) \Pj(\chi)} = \ln \pf{ \Pj(\chi|w)}{\Pj(\chi)}.\] Under this embedding, the summand equals $v_a-v_b-v_c+v_d$. To find $a$, solve \[ \amin_a \ve{v_a-v_b-v_c+v_d}^2.\]
Algorithm: GloVe (global vector) method (Pennington) Let $X_{w,w'}$ be the co-occurrence for words $w,w'$. Find low-dimensional $\wt v_w, \wt v_{w'}, \wt b_w, \wt b_{w'}$ to minimize \[ \sum_{w,w'} f(X_{w,w'}) (\an{\wt v_w, \wt v_{w'}} - \wt b_w- \wt b_{w'} - \ln X_{w,w'})^2\] for some function $f$. They choose $f(x) = \min\bc{\pf{x}{x_{\max}}^{.75}, 1}, x_{\max}=100$ from experiments.

(How to optimize this?)

Mysteries

There is a disconnect between the definition of $v_w$ and the estimate $\wt v_w$. Namely, the $v_w$ are vectors giving the PMI with all contexts while $\wt v_w$ are the vectors such that $\an{\wt v_w,\wt v_{w'}}$ give the word-word co-occurences. Why do the learned $\wt v_w$ help in solving analogies? Why is the optimization problem in the algorithm a good proxy?
Why is this method stable to noise?

Model

Let the dimension of the underlying space be $d$.

There are $n$ words $w$ in the dictionary $W$, and they are associated with vectors $v_w\in \R^d$.
- The vectors $v_w$ are iid generated by $v=s\cdot \wh v$ where
  - $\wh v\sim N(0,I_d)$,
  - $s$ is a random scalar with $\si^2\le d$ and $|s|\le \ka \sqrt d$ for some constant $\ka$.¹
At each time $t\in \N$, there is a context vector $c_t\in \R^d$.
- The vectors $c_t$ follow a random walk, satisfying the following:
  - (Uniform stationary distribution) The stationary distribution is $[-\rc{\sqrt d},\rc{\sqrt d}]^d$.
  - (Small drift) The drift in the context vector is $\ve{c_{t+1}-c_t}_1\le \rc{\ln n}$. (There are more complicated general conditions under which the theorems work; take this for simplicity.)
At each time $t$, word $w$ is emitted with probability \[\begin{align} \Pj[w|c_t] &= \rc{Z_{c_t}} \exp(\an{v_w,c_t})\\ \text{where} Z_{c}:&= \sum_w \exp(\an{v_w,c}). \end{align}\]

Explanation

Let $\Pj(w,w')$ be the probability that $w,w'$ appear consecutively at time $t,t+1$ when $c_t\sim U_{[-\rc{\sqrt d},\rc{\sqrt d}]^d}$ is drawn from the stationary distribution. Then the following hold.

Theorem 1: With high probability over choice of $v_w$’s, \[\begin{align} \forall &w,w', & \ln \Pj (w,w')&\approx \rc{2d}|v_w+v_w'|^2 - 2\lg Z - o(1)\\ \forall & w,& \lg \Pj(w) &\approx \rc{2d} |v_w|^2 - \lg Z -o(1)\\ \therefore && \lg \fc{\Pj[w,w']}{\Pj[w]\Pj[w']} &\approx \rc d \an{v_w,v_w'}\pm o(1). \end{align}\]

This is exactly the PMI, so the theorem “explains” Mystery 1.

Proof idea:

Let $c$ be the context vector at time $t$ and $c'$ be the vector at time $t+1$.

The main difficulty is that $Z_c$ is intractable to compute. However, because the $v_w$ are random (or isotropic), so $Z_c$ concentrates around its mean, and we can approximate it by a constant $Z$ (Theorem 2 in the paper).
Because drift is small, we can make the approximation $c'\approx c$. Then \[\begin{align} \Pj(w,w') &= \int_{c,c'} \Pj(w|c)\Pj(w'|c_{t+1})\Pj(c,c')\,dc\,dc'\\ &\sim \rc{Z^2} \EE_{c} \exp(\an{v_w+v_w',c})\\ &\sim \rc{Z^2}\exp\pf{|v_w+v_{w'}|^2}{2} \end{align}\] using some calculus in the last step (exercise).

The calculation for $\Pj(w)$ is even simpler.

Algorithm

What algorithm does the theory suggest to estimate the $v_w$’s?

It suggests minimizing an objective as in GloVe. The weights $f_{w,w'}$ re selected to compensate noise in $X_{w,w'}$; when $X_{w,w'}$ is on the average larger, it has lower variance, and the weight $f_{w,w'}$ is larger.

(What improvement does it suggest?)

Followup

Polysemy
Weighted SVD (Yuanzhi Li)

The results in the paper are stated as “with high probability over the choice of $v_w$”. This can probably be relaxed to “For all $v_w$ that are isotropic”, where isotropic means $\EE_w [v_wv_w^T]$ has all eigenvalues in $[1,1+\de]$.↩