(ALLMR16) Linear Algebraic Structure of Word Senses, with Applications to Polysemy
Posted: 2016-03-15 , Modified: 2016-03-15
Tags: nlp, word embeddings
Posted: 2016-03-15 , Modified: 2016-03-15
Tags: nlp, word embeddings
Words don’t just live in a small-dimensional space; they are composed of a small number of atoms plus noise.
\[v_w = Ax_w + n_w\] where \(\Supp(x_w)\le s\) (\(s=5\) works well), \(A\) is overcomplete \(d\times k\), \(k>d\), and \(n_w\) is noise. \(k=2000\) works well. The \(A_{\bullet i}\) are discourse vectors. They correspond to different senses (meanings) of the word. (Rather than just different features of the word.)
Do dictionary learning another time (recursive/hierarchical!)1 with \(k'=200\) and sparsity \(s'=2\). Get meta-discourse vectors.
For a polysemous word, given lists (by WordNet/humans) of sets of 8 words representing senses,
Let \(r=\fc{\Pj(w_2)}{\Pj(w_1)}\). The set of words that appear in both \(w_1,w_2\) is much smaller than the set of words that appear with exactly one of them.
Suppose each \(\chi\) appears with only one \(w_i\). Let \(T_i\) be the \(\chi\) appearing with \(w_i\). Then \[\begin{align} \chi&\in T_1\implies &PMI(\chi,w) &= PMI(\chi,w_1) - \ub{\log(1+r)}{\ka_1}\\ \chi&\in T_2\implies &PMI(\chi,w) &= PMI(\chi,w_2) - \ub{\log(1+\rc r)}{\ka_2}. \end{align}\]Identify words and their context vectors.
Claim: under the assumptions
we have \[\amin_z \sum_\chi \Pj(\chi,w) (PMI(\chi,w)-\an{\chi,z})^2\approx \sumo i2(1-\ka_ic_i)v_i.\] 2
Note the logarithmic scaling—this means that even small senses are detectable.