Ad verba per numeros
Tuesday, January 11, 2011, 06:51 PM
Therefore, I argued that using n-grams could be a more robust approach and, besides, the model could be trained on different data once we are sure the tweets are actually written in a given language.So, first of all, what's an n-gram? A subsequence of n successive characters extracted from a given text string. For example, in the previous tweet we'd find the following 3-grams:
@justinbieber omg Justin bieber ur amazing lol : )
@ju jus ust sti ... lol ol l : : )
N-grams can be obtained for texts of any length and, thus, the underlying idea is to collect a list of n-grams (ideally with their relative frequency or, even better, their use probability) from a collection of documents. Ideally, the collection should be similar to the documents you are to identify; that is, if you are going to classify tweets you shouldn't train on the Shakespeare's works. However, you are probably using any documents you find (for this post I've used the text of "The Universal Declaration of Human Rights").Then, for any document you want to classify you just need to obtain a similar n-gram vector and compute their similarity (e.g. cosine, Jaccard, Dice, etc.) Needless to say, when the document to classify is very short (such as tweets) most of the n-grams appearing within the document are going to be unique and, thus, awkward results can be obtained. If you are performing language identification in such short texts it's much better to just count the number of n-grams from the short text which appear for each language model and choose that language with a larger coverage.For instance, let's take the following short texts:
(German, 48 4-grams) Als er erwachte, war der Dinosaurier immer noch da. (Galician, 44 4-grams) Cando espertou, o dinosauro aínda estaba alí. (Spanish, 51 4-grams) Cuando despertó, el dinosaurio todavía estaba allí. (Basque, 43 4-grams) Esnatu zenean, dinosauroa han zegoen oraindik. (Catalan, 46 4-grams) Quan va despertar, el dinosaure encara era allà. (English, 43 4-grams) When [s]he awoke, the dinosaur was still there.
Using the model I've built, each of the texts has the following significant intersections:
Als er erwachte, war der Dinosaurier immer noch da. => German, 27 common 4-grams Cando espertou, o dinosauro aínda estaba alí. => Portuguese, 18 common 4-grams Cando espertou, o dinosauro aínda estaba alí. => Galician, 17 common 4-grams Cuando despertó, el dinosaurio todavía estaba allí. => Spanish, 21 common 4-grams Cuando despertó, el dinosaurio todavía estaba allí. => Asturian, 20 common 4-grams Esnatu zenean, dinosauroa han zegoen oraindik. => Basque, 17 common 4-grams Quan va despertar, el dinosaure encara era allà. => Catalan, 21 common 4-grams Quan va despertar, el dinosaure encara era allà. => Spanish, 20 common 4-grams Quan va despertar, el dinosaure encara era allà. => Asturian, 20 common 4-grams When [s]he awoke, the dinosaur was still there. => English, 15 common 4-grams
If we choose the language with the largest intersection then we have that each text is classified as:
Als er erwachte, war der Dinosaurier immer noch da. => German, Correct! Cando espertou, o dinosauro aínda estaba alí. => Portuguese, Incorrect, but a near miss Cuando despertó, el dinosaurio todavía estaba allí. => Spanish, Correct! Esnatu zenean, dinosauroa han zegoen oraindik. => Basque, Correct! Quan va despertar, el dinosaure encara era allà. => Catalan, Correct! When [s]he awoke, the dinosaur was still there. => English, Correct!
Another advantage of using n-gram models is that they decay gracefuly. For instance, classifying a short text written in Galician as Portuguese is rather acceptable. Or let's take this text:
"Hrvatski jezik skupni je naziv za standardni jezik Hrvata, i za skup narjecja i govora kojima govore ili su nekada govorili Hrvati."It's actually Croatian, but since I did not train my system on Croatian samples it's classified as Serbian which, again, is reasonable.In addition to this (hopefully) explanatory post, I've developed a bit of source code. You can try the demo and download the source code and data files (it's PHP, so proceed at your discretion).As usual, if you want to discuss something on this post, just tweet me at @pfcdgayo
Back Next