Task
Given a corpus (a long sequence of words, e.g. the text of Wikipedia) to learn from, answer analogy questions such as ?:queen::man:woman.
Previous work, observed phenomena
The usual approach is to learn “word vectors.”
Define a context of a word \(w\) in the corpus to be the two words (say) on either side of \(w\). (Thus, for a context \(\chi=(w_{-2},w_{-1},w_1,w_2)\), \(\Pj(\chi|w)\) means the probability of observing the words of \(\chi\) in a window of length 2, given that the middle word is \(w\).) (Mikolov)
Pennington et al. and Levy and Goldberg posit the following approach.
- Model: If \(a:b::c:d\) is an analogy, then \[ \fc{\Pj(\chi|a)}{\Pj(\chi|b)} \approx \fc{\Pj(\chi|a)}{\Pj(\chi|b)}.\]
- Thus, the solution to the analogy \(?:b::c:d\)$ is \[ \amin_w \sum_\chi \pa{ \ln \pf{\Pj(\chi|w)}{\Pj(\chi|b)} - \ln \pf{\Pj(\chi|c)}{\Pj(\chi|d)} }^2.\]
- Define probability-mutual-information \[ PMI(w,\chi) = \ln \pf{\Pj(w,\chi)}{\Pj(w)\Pj(\chi)} = \ln \pf{\Pj(\chi|w)}{\Pj(\chi)}.\] Let \(C\) be the set of possible contexts. For a word \(w\), let \(v_w\) be the vector indexed by contexts \(\chi \in C\), \[ v_w(\chi)= {PMI}(w,\chi) = \ln \pf{ \Pj(w,\chi)}{\Pj(w) \Pj(\chi)} = \ln \pf{ \Pj(\chi|w)}{\Pj(\chi)}.\] Under this embedding, the summand equals \(v_a-v_b-v_c+v_d\). To find \(a\), solve \[ \amin_a \ve{v_a-v_b-v_c+v_d}^2.\]
- Algorithm: GloVe (global vector) method (Pennington) Let \(X_{w,w'}\) be the co-occurrence for words \(w,w'\). Find low-dimensional \(\wt v_w, \wt v_{w'}, \wt b_w, \wt b_{w'}\) to minimize \[ \sum_{w,w'} f(X_{w,w'}) (\an{\wt v_w, \wt v_{w'}} - \wt b_w- \wt b_{w'} - \ln X_{w,w'})^2\] for some function \(f\). They choose \(f(x) = \min\bc{\pf{x}{x_{\max}}^{.75}, 1}, x_{\max}=100\) from experiments.
(How to optimize this?)
Mysteries
- There is a disconnect between the definition of \(v_w\) and the estimate \(\wt v_w\). Namely, the \(v_w\) are vectors giving the PMI with all contexts while \(\wt v_w\) are the vectors such that \(\an{\wt v_w,\wt v_{w'}}\) give the word-word co-occurences. Why do the learned \(\wt v_w\) help in solving analogies? Why is the optimization problem in the algorithm a good proxy?
- Why is this method stable to noise?
Model
Let the dimension of the underlying space be \(d\).
- There are \(n\) words \(w\) in the dictionary \(W\), and they are associated with vectors \(v_w\in \R^d\).
- The vectors \(v_w\) are iid generated by \(v=s\cdot \wh v\) where
- \(\wh v\sim N(0,I_d)\),
- \(s\) is a random scalar with \(\si^2\le d\) and \(|s|\le \ka \sqrt d\) for some constant \(\ka\).
- At each time \(t\in \N\), there is a context vector \(c_t\in \R^d\).
- The vectors \(c_t\) follow a random walk, satisfying the following:
- (Uniform stationary distribution) The stationary distribution is \([-\rc{\sqrt d},\rc{\sqrt d}]^d\).
- (Small drift) The drift in the context vector is \(\ve{c_{t+1}-c_t}_1\le \rc{\ln n}\). (There are more complicated general conditions under which the theorems work; take this for simplicity.)
- At each time \(t\), word \(w\) is emitted with probability
\[\begin{align}
\Pj[w|c_t] &= \rc{Z_{c_t}} \exp(\an{v_w,c_t})\\
\text{where} Z_{c}:&= \sum_w \exp(\an{v_w,c}).
\end{align}\]
Explanation
Let \(\Pj(w,w')\) be the probability that \(w,w'\) appear consecutively at time \(t,t+1\) when \(c_t\sim U_{[-\rc{\sqrt d},\rc{\sqrt d}]^d}\) is drawn from the stationary distribution. Then the following hold.
Theorem 1: With high probability over choice of
\(v_w\)’s,
\[\begin{align}
\forall &w,w', &
\ln \Pj (w,w')&\approx \rc{2d}|v_w+v_w'|^2 - 2\lg Z - o(1)\\
\forall & w,&
\lg \Pj(w) &\approx \rc{2d} |v_w|^2 - \lg Z -o(1)\\
\therefore && \lg \fc{\Pj[w,w']}{\Pj[w]\Pj[w']} &\approx \rc d \an{v_w,v_w'}\pm o(1).
\end{align}\]
This is exactly the PMI, so the theorem “explains” Mystery 1.
Proof idea:
Let \(c\) be the context vector at time \(t\) and \(c'\) be the vector at time \(t+1\).
- The main difficulty is that \(Z_c\) is intractable to compute. However, because the \(v_w\) are random (or isotropic), so \(Z_c\) concentrates around its mean, and we can approximate it by a constant \(Z\) (Theorem 2 in the paper).
- Because drift is small, we can make the approximation \(c'\approx c\). Then
\[\begin{align}
\Pj(w,w') &= \int_{c,c'} \Pj(w|c)\Pj(w'|c_{t+1})\Pj(c,c')\,dc\,dc'\\
&\sim \rc{Z^2} \EE_{c} \exp(\an{v_w+v_w',c})\\
&\sim \rc{Z^2}\exp\pf{|v_w+v_{w'}|^2}{2}
\end{align}\]
using some calculus in the last step (exercise).
The calculation for \(\Pj(w)\) is even simpler.
Algorithm
What algorithm does the theory suggest to estimate the \(v_w\)’s?
It suggests minimizing an objective as in GloVe. The weights \(f_{w,w'}\) re selected to compensate noise in \(X_{w,w'}\); when \(X_{w,w'}\) is on the average larger, it has lower variance, and the weight \(f_{w,w'}\) is larger.
(What improvement does it suggest?)
Followup
- Polysemy
- Weighted SVD (Yuanzhi Li)