Ad verba per numeros

Tuesday, September 22, 2009, 09:36 PM
It's only tuesday and I've already read (OK, skimmed) a bunch of realy inspiring/enjoyable papers.

The first one I'd like to remark is "Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict" by Monroe, Colaresi and Quinn.

Entries in the burgeoning "text-as-data" movement are often accompanied by lists or visualizations of how word (or other lexical feature) usage differs across some pair or set of documents. These are intended either to establish some target semantic concept (like the content of partisan frames) to estimate word-specific measures that feed forward into another analysis (like locating parties in ideological space) or both. We discuss a variety of techniques for selecting words that capture partisan, or other, differences in political speech and for evaluating the relative importance of those words. We introduce and emphasize several new approaches based on Bayesian shrinkage and regularization. We illustrate the relative utility of these approaches with analyses of partisan, gender, and distributive speech in the U.S. Senate.

This paper discusses different ways to discover those terms which really define a party's position on a certain topic; from that starting point the authors discuss different approaches and point out their weaknesses. Many of those issues are highly relevant to other NLP or IR tasks and, thus, I selected some excerpts for your reflecting pleasure:

One approach, standard in the machine learning literature, is to treat this [finding partisan terms] as a classification problem. In our example, we would attempt to find the words (w) that significantly predict partisanship (p). A variety of established machine learning methods could be used [...]. These approaches would attempt to find some classifier function that mapped words to some unknown party label. The primary problem of this approach, for our purposes, is that it gets the data generation process backwards. Party is not plausibly a function of word choice. Word choice is (plausibly) a function of party.
A common response [...] in many natural language processing applications is to eliminate "function" or "stop" words that are deemed unlikely to contain meaning. [...] We note, however, the practice of stop word elimination has been found generally
to create more problems than it solves, across natural language processing applications. Manning et al. (2008) observe: "The general trend ... over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever". They give particular emphasis to the problems of searching for phrases that might disappear or change meaning without stop words (e.g., "to be or not to be"). More to the point, this ad hoc solution diagnoses the problem incorrectly. Function words are not dominant in the partisan word lists here because they are function words, but because they are frequent. [...] Eliminating function words not only eliminates words inappropriately but it also elevates high-frequency non–stop words inappropriately.
Eliminating low-frequency words. Although this is a very basic statistical idea, it is commonly unacknowledged in simple
feature selection and related ranking exercises. A common response is to set some frequency "threshold" for features to "qualify" for consideration. Generally, this simply removes the most problematic features without resolving the issue.

Enough to say that Monroe et al. cleary prefer model-based models and they provide a thorough description of their Bayesian approach.

Not totally unrelated to this I think these two nifty pieces of software can be of interest to you: The Java Implementation of Naive Credal Classifier 2 by Corani and Zaffalon and ReadMe: Software for Automated Content Analysis by Hopkins et al.

Two other interesting/intriguing papers: "Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology" by Grimmer and King and "An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design" by King and Lowe. After reading the work by Monroe et al. I'd love to know what would they say about the approach by Grimmer and King in the first paper :)

We begin with a set of text documents of variable length. For each, we adopt the most common procedures for representing them quantitatively: we transform to lower case, remove punctuation, replace words with their stems, and drop words appearing in fewer than 1% or more than 99% of documents. For English documents, about 3,500 unique word stems usually remain in the entire corpora.

Should you wonder about my modest opinion: I see no problem with using thresholds, most of the time they work fine (after you tune them, of course) and, by the way, word independence (required by Bayes approaches) is just a different working assumption ;)

The last paper I'd like to mention is "Using cocitation information to estimate political orientation in web documents" by Miles Efron.

This paper introduces a simple method for estimating cultural orientation, the affiliation of online entities in a polarized field of discourse. In particular, cocitation information is used to estimate the political orientation of hypertext documents. A type of cultural orientation, the political orientation of a document is the degree to which it participates in traditionally left- or right-wing beliefs. Estimating documents’ political orientation is of interest for personalized information retrieval and recommender systems. In its application to politics, the method uses a simple probabilistic model to estimate the strength of association between a document and left- and right-wing communities. The model estimates the likelihood of cocitation between a document of interest and a small number of documents of known orientation. The model is tested on three sets of data, 695 partisan web documents, 162 political weblogs, and 198 nonpartisan documents. Accuracy above 90% is obtained from the cocitation model, outperforming lexically based classifiers at statistically significant levels.

This paper is extremely interesting, the main idea is that "a man is known by the company he keeps", that is, if your website is frequently cocited with lef- or right-wing websites then your website could be classified in the lef-right political spectrum.

This reminds me of an extremely upsetting research disclosed very recently: the so-called project Gaydar at MIT, the aim of which is to determine if a user is or not gay according to his/her contacts in Facebook. IMHO these guys have totally crossed the line.

Back Next