Blog < Daniel Gayo-Avello @ the University of Oviedo

Information extraction and Google Squared

HOT!, Técnicas PLN/NLP, Recuperación de Información, Servicios y Utilidades, Hype-Buzz
Thursday, June 4, 2009, 11:18 AM

Let's start with a short definition by Wikipedia (yeah, I know, real men don't cite encyclopedias):

Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured machine-readable documents.

In other words, the goal of information extraction is to obtain table schemas from raw data and then complete the records in such tables.

Nonetheless to say that information extraction is a really tough problem and that, to the best of my knowledge, there are no other publicly available application other than the recent Google Squared.

I suppose that these days most of the people (i.e. bloggers) would be joyfully jumping around this new tool (which, I confess, is pretty amazing). However, I would like to point to an academic project that I knew 2 years ago: the Proteus Project at the New York University.

As I've said I had the opportunity to attend a talk by Satoshi Sekine and he showed us a piece of software doing mostly the same than Google Squared. Obviusly, I was impressed and excited, that's why I'm not so excited today with Squared

Anyway, a really interesting tool, still in its first phases and something I really wanted to try.

One feature I miss in the current implementation is the possibility of adding/correcting the actual cells. Perhaps this way the googlers behind Squared could enrich the datasets and the system would learn new patterns/rules to extract new information.

Additionally, I would also like to point a bunch of papers highly related (in my opinion) with Google Squared and which, I think, could be of your interest:

S. Brin, "Extracting Patterns and Relations from the World Wide Web," Selected papers from the International Workshop on The World Wide Web and Databases, Springer-Verlag, 1999, pp. 172-183.
M. Pasca, "Acquisition of categorized named entities for web search," Proceedings of the thirteenth ACM international conference on Information and knowledge management, Washington, D.C., USA: ACM, 2004, pp. 137-145.
M. Pasca, "Organizing and searching the world wide web of facts--step two: harnessing the wisdom of the crowds," Proceedings of the 16th international conference on World Wide Web, 2007, pp. 101-110.
M. Pasca, "Weakly-supervised discovery of named entities using web search queries," Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007, pp. 683-690.
M. Pasca and B. Van Durme, "What you seek is what you get: Extraction of class attributes from query logs," Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07), p. 28322837.

Ad verba per numeros