(ALLMR16) RAND-WALK - A latent variable model approach to word embeddings

Posted: 2016-03-15 , Modified: 2016-03-15

Tags: nlp, word embeddings

Task

Given a corpus (a long sequence of words, e.g. the text of Wikipedia) to learn from, answer analogy questions such as ?:queen::man:woman.

Previous work, observed phenomena

The usual approach is to learn “word vectors.”

Define a context of a word \(w\) in the corpus to be the two words (say) on either side of \(w\). (Thus, for a context \(\chi=(w_{-2},w_{-1},w_1,w_2)\), \(\Pj(\chi|w)\) means the probability of observing the words of \(\chi\) in a window of length 2, given that the middle word is \(w\).) (Mikolov)

Pennington et al. and Levy and Goldberg posit the following approach.

(How to optimize this?)

Mysteries

  1. There is a disconnect between the definition of \(v_w\) and the estimate \(\wt v_w\). Namely, the \(v_w\) are vectors giving the PMI with all contexts while \(\wt v_w\) are the vectors such that \(\an{\wt v_w,\wt v_{w'}}\) give the word-word co-occurences. Why do the learned \(\wt v_w\) help in solving analogies? Why is the optimization problem in the algorithm a good proxy?
  2. Why is this method stable to noise?

Model

Let the dimension of the underlying space be \(d\).

Explanation

Let \(\Pj(w,w')\) be the probability that \(w,w'\) appear consecutively at time \(t,t+1\) when \(c_t\sim U_{[-\rc{\sqrt d},\rc{\sqrt d}]^d}\) is drawn from the stationary distribution. Then the following hold.

Theorem 1: With high probability over choice of \(v_w\)’s, \[\begin{align} \forall &w,w', & \ln \Pj (w,w')&\approx \rc{2d}|v_w+v_w'|^2 - 2\lg Z - o(1)\\ \forall & w,& \lg \Pj(w) &\approx \rc{2d} |v_w|^2 - \lg Z -o(1)\\ \therefore && \lg \fc{\Pj[w,w']}{\Pj[w]\Pj[w']} &\approx \rc d \an{v_w,v_w'}\pm o(1). \end{align}\]

This is exactly the PMI, so the theorem “explains” Mystery 1.

Proof idea:

Let \(c\) be the context vector at time \(t\) and \(c'\) be the vector at time \(t+1\).

  1. The main difficulty is that \(Z_c\) is intractable to compute. However, because the \(v_w\) are random (or isotropic), so \(Z_c\) concentrates around its mean, and we can approximate it by a constant \(Z\) (Theorem 2 in the paper).
  2. Because drift is small, we can make the approximation \(c'\approx c\). Then \[\begin{align} \Pj(w,w') &= \int_{c,c'} \Pj(w|c)\Pj(w'|c_{t+1})\Pj(c,c')\,dc\,dc'\\ &\sim \rc{Z^2} \EE_{c} \exp(\an{v_w+v_w',c})\\ &\sim \rc{Z^2}\exp\pf{|v_w+v_{w'}|^2}{2} \end{align}\] using some calculus in the last step (exercise).

The calculation for \(\Pj(w)\) is even simpler.

Algorithm

What algorithm does the theory suggest to estimate the \(v_w\)’s?

It suggests minimizing an objective as in GloVe. The weights \(f_{w,w'}\) re selected to compensate noise in \(X_{w,w'}\); when \(X_{w,w'}\) is on the average larger, it has lower variance, and the weight \(f_{w,w'}\) is larger.

(What improvement does it suggest?)

Followup


  1. The results in the paper are stated as “with high probability over the choice of \(v_w\)”. This can probably be relaxed to “For all \(v_w\) that are isotropic”, where isotropic means \(\EE_w [v_wv_w^T]\) has all eigenvalues in \([1,1+\de]\).