Ad verba per numeros

Let's start with a short definition by Wikipedia (yeah, I know, real men don't cite encyclopedias):

Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured machine-readable documents.

In other words, the goal of information extraction is to obtain table schemas from raw data and then complete the records in such tables.

Nonetheless to say that information extraction is a really tough problem and that, to the best of my knowledge, there are no other publicly available application other than the recent Google Squared.

I suppose that these days most of the people (i.e. bloggers) would be joyfully jumping around this new tool (which, I confess, is pretty amazing). However, I would like to point to an academic project that I knew 2 years ago: the Proteus Project at the New York University.

As I've said I had the opportunity to attend a talk by Satoshi Sekine and he showed us a piece of software doing mostly the same than Google Squared. Obviusly, I was impressed and excited, that's why I'm not so excited today with Squared :)

Anyway, a really interesting tool, still in its first phases and something I really wanted to try.

One feature I miss in the current implementation is the possibility of adding/correcting the actual cells. Perhaps this way the googlers behind Squared could enrich the datasets and the system would learn new patterns/rules to extract new information.

Additionally, I would also like to point a bunch of papers highly related (in my opinion) with Google Squared and which, I think, could be of your interest:



Next