Back to Research

PhD Dissertation

On June 2005 I finished my PhD dissertation entittled "blindLight: Una nueva técnica para procesamiento de texto no estructurado mediante vectores de n-gramas de longitud variable con aplicación a diversas tareas de tratamiento de lenguaje natural" which describes (in Spanish) a new statistical NLP method based on character n-grams with applications to clustering, IR, keyphrase extraction and text summarization.

My dissertation defense took place on December 16th, 2005. The members of the committee were José Antonio López Brugos (chair) and Jose Emilio Labra Gayo, both from the University of Oviedo, Julio Gonzalo from UNED, Elena Jurado from UEX and José Manuel Gutiérrez from the University of Cantabria.

From left to right: Elena Jurado, Jose Manuel Gutierrez, me, Jose Emilio Labra, Darío Álvarez (thesis advisor), Julio Gonzalo and José Antonio López Brugos.

Abstract

It is possible to automatically transform texts written in any western language in variable-length n-gram vectors which preserve some of the semantics from the source texts. Such vectors can transform the primary information, be compared and even combined with each other highlighting, as a result, much of the semantics from the original document.

N-grams have been frequently used to perform different natural language processing tasks. Such methods show many features in common: (1) documents are represented using a vector space where n-grams are taken as coordinates and n-gram frequencies within documents as vector weights, (2) many of these techniques require a background which plays a role similar to that of lists of stop words and (3) the cosine similarity is normally used to compare documents to each other and documents to queries.

blindLight is a new approach, proposed by this researcher, related to such "classical" methods but with two major changes: (1) n-gram relative frequencies within documents are no more used as vector weights but their significances and (2) cosine distance is abandoned in favor of a new measure inspired by sequence alignment techniques although not so computationally expensive. Such a new proposal can be used to perform automatic document clustering and categorization, information retrieval, in addition to keyphrase extraction and automatic summarization. Such tasks are essential tools to fight "information overload" and improve user experience.

Downloads

You can download the dissertation and the slides I used for the defense. In case you have any problem downloading any of these files just email me.

I've also splitted the PDF document into smaller files for your convenience:

You may find of interest these papers.