Ad verba per numeros
InvestigaciónThis blog is not abandoned but I've been extremely busy last months. The output of such laborious work have been a number of papers (two of them coauthored with Brenes). So, just in case any of you is interested:A survey on session detection methods in query logs and a proposal for future evaluation, Information Sciences (2009), Volume 179, Issue 12, pp. 1822-1843. Author's version
Tuesday, March 17, 2009, 08:29 AM
Tuesday, March 17, 2009, 08:29 AM
Stratified Analysis of AOL Query Log, Information Sciences (2009), Volume 179, Issue 12, pp. 1844-1858. (Coauthored with Brenes). Authors' version
Search engine logs provide a highly detailed insight of users interactions. Hence, they are both extremely useful and sensitive. The datasets publicly available to scholars are, unfortunately, too few, too dated and too small. [...] Even worse, the large query log disclosed by AOL in 2006 caused more harm than good because of a big privacy flaw. In this paper the author provides an overall view of the possible applications of query logs, the privacy concerns researchers must face when working on such datasets, and several ways in which query logs can be easily sanitized. One of such measures consists of segmenting the logs into short topical sessions. Therefore, the author offers a comprehensive survey of session detection methods, as well as a thorough description of a new evaluation framework with performance results for each of the different methods. Additionally, a new, simple, but outperforming session detection method is proposed. It is a heuristic-based technique which works on the basis of a geometric interpretation of both the time gap between queries and the similarity between them in order to flag a topic shift.
Characterizing users intent and behaviour while using a retrieval information tool (e.g. a search engine) is a key question on web research, as it hold the keys to know how the users interact, what they are expecting and how we can provide them information in the most beneficial way. Previous research has focused on identifying the average characteristics of user interactions. This paper proposes a stratified method for analyzing query logs that groups queries and sessions according to their hit frequency and analyzes the characteristics of each group in order to find how representative the average values are. Findings show that behaviours typically associated with the average user do not fit in most of the aforementioned groups.Survey and evaluation of query intent detection methods. WSCD'09 (coauthored with Brenes and Kilian). Authors' version
User interactions with search engines reveal three main underlying intents, namely navigational, informational, and transactional. By providing more accurate results depending on such query intents the performance of search engines can be greatly improved. Therefore, query classification has been an active research topic for the last years. However, while query topic classification has deserved a specific bakeoff, no evaluation campaign has been devoted to the study of automatic query intent detection. In this paper some of the available query intent detection techniques are reviewed, an evaluation framework is proposed, and it is used to compare those methods in order to shed light on their relative performance and drawbacks. As it will be shown, manually prepared gold-standard files are much needed, and traditional pooling is not the most feasible evaluation method. In addition to this, future lines of work in both query intent detection and its evaluation are proposed.
Survey and evaluation of query intent detection methods
Daniel Gayo Avello, Kilian Pérez-González, David J. Brenes