NLP 2016 Senior presentations
Posted: 2016-05-14 , Modified: 2016-05-14
Tags: nlp
Posted: 2016-05-14 , Modified: 2016-05-14
Tags: nlp
NLP group meeting 5-14-16
Longitudinal, with 5000 documents per person spanning years, with data from private correspondences (more unfiltered). Use presidential letters from Adams, Jefferson, and Washington.
Use Harvard Inquirer Categories which puts words in categories (ex. “strong” contains audacity, battle, wealthy, charisma, clever, climax, further). These are hand-crafted categories. There are 200 categories; number of words vary. Words can appear in multiple categories.
Start with a specific event and analyze the way they respond. For example, use Washington’s early campaigns from the French and Indian War. Find keywords and phrases relating specifically to the event. Use these keywords to find other letters in the same category.
Differentiate between attitude in these letters and overall attitude by the difference of means statistical test. There were differences in overstatement, negative, understtement, vice, and arousal (negative).
In the early French and Indian War, Washington had limited resources, Americans would desert, etc.
Use Lasswell categories for general event analysis:
Use a Bayesian classifier for all letters. Put in the category that it correlates with the most, provided it’s at least a certain threshold.
Washington and Adams react to power loss differently. For example, Washington was more active and passive, and both understated and overstated more.
Vice put together with self-pronouns means self-doubt.
The goal was to find an interesting way to measure the resilience of a president in the face of tough events. I found some metrics which yielded useful, informative results.
Another example: Adams was pessimistic and powerless about not being able to go home; Jefferson was closely involved with the education of his kids.
How to use this? Facebook posts, emails (Clinton, Enron, Wikileaks). Panama papers?
Goal: create a system for machine comprehension. Previous work focused on reading questions for young children (MCTest), with very low accuracy on “why” questions. Most of these are “bag of words” models.
But we want to capture semantic and structural information. We use SAT questions, which are longer and more complex, and answering requires understanding deeper information about the context. They use different words, “angry” vs. “incensed”, etc.
We use a knowledge graph, consist of entities and relations \(\an{e_1,r,e_2}\). Entities are nodes, and relations are directed edges. We can ask queries \(\an{e_1,?,e_2},\an{e_1,r,?}\).
(Google’s knowledge graph is FreeBase, 300GB. Nell at Carnegie Mellon is trained to extract triples from the Internet.)
We focus on Subject-Verb-Direct object (ex. dog likes sausage) or copular relations (subject and description, ex. dog is green).
We used Stanford high-dependency parser. We can do pronoun-antecedent resolutions as well.
Represent the knowledge graph as a 3-D array where each \(X_k\) is the adjacency matrix for relation type \(k\). Approach: find rank \(r\) approximation \[X_k\approx AR_kA^T,\] (\(n\times r, r\times r\), \(n\) is the number of entities) for latent features. Use alternating least squares, semantically smoth embedding. Answer questions using factorization. (The slices are dependent because the \(A\) is shared.)
Stories about a character make one of the slices dense; explanatory articles are a lot sparser. Most tensors are sparse.
(Each slice is factorized separately? Use Rescal.)
Use neural network to predict answers. Embed the tensor and vectorized query \(q\) as inputs. Train the network to recognize what information is relevant, and output an answer choice.
Tensor model trains independently, while the NN needs to be trained on 2/3 of the questions. Tensor model is consistent but with low ceiling.
The tensor model completes a triple rather than just give an answer choice.
Can we use a mathematical approach to constructing the WordNet graph?
Current automated methods include merging or extending. Use multi-language translations (French WordNet), scrape Wiktionary, or use rule-based/automatic methods.
Use word embeddings. Optimize the squared norm objective \[\min_{v\in \R^d, C\in \R} \sum_{(i,j)\in [V]^2} f(X_{ij}) (\ve{v_w+v_j}_2^2 + C-\log X_{ij})^2.\] (Distributional assumption.) Word vectors have been used in solving analogies, named-entity recognition, and word similarity.
Use the discourse model: a corpus is generated by a random walk over the discourse space \(\R^d\).
Arora et al. used a learned dictionary of atoms to find representation of the meanings of a word \(w\in V\) with a dictionary \(A\in \R^{K\times d}\) and sparsity constraint \(s\):
Consider hypernyms.
This fails on real datasets because of word-relation sparsity and similar relations problem (problems discerning hypernyms and cohypernyms).
There is a spatial distinction between hyponym-hypernym and random pairs, but not between hyponym-hypernym and co-hypernym (sharing the same hypernym) pairs.
We can link to synsets in the Princeton WordNet.
Given \(w\) in a foreign language to synsets in the Princeton WordNet,
Ex. The word “container” doesn’t exist in Dutch. (cf. Hofstadter, Surfaces and Essences.)