Ad verba per numeros

HOT!, Artículos
Saturday, January 15, 2011, 12:06 AM
Update (January 18, 2011): A few bloggers published some posts on the topic. José López Ponce (in Spanish), Dorjival Silva (in Portuguese), Aymeric Pontier (in French), Dominique Desaunay (in French), and iriospark (in Italian).

Update (January 17, 2011): It also appeared in the print version of Le Monde (PDF).

Update (January 15, 2011): Le Monde (French newspaper) has featured this study.

More than a year ago I wrote in critical terms about MIT's Gaydar project, the aim of which was determining if a Facebook user is or not gay from his/her contacts. A paper reporting the results was later published and I cannot help to recommend it.

Needless to say, I found the study disturbing but, even more disturbingly, in short time I was up to my neck in a very similar work. My original purposes were two-fold: to find close communities in Twitter, and to find interesting users to recommend. Preliminary work was done and, at that point, I found that detailed demographic information about users could be very valuable (to test the goodness of the detected communities and to improve the recommendations). Facebook already knows that but Twitter profiles are not that detailed.

Thus, I started working on inferring user's attributes from known attributes from their neighbors. After all, I already knew the sex, age, and location for a number of users (see the Appendix of this paper). Could it be possible to determined those values for unlabeled users? Reluctantly, I also turned to other attributes which are generally considered sensitive: political and religious beliefs, sexual orientation and race/ethnicity.

As you probably know, Facebook asks for all of them (except for race/ethnicity) and many people provide them. I'm not much of a Facebook user so I don't care a lot about information in my friends profiles but, because of this study, I checked several Facebook profiles: a number of them include their political and religious beliefs and, a few, what they are "interested in" (i.e. the Facebook's euphemism for sexual orientation).

I was not really shocked nor worried about this but then, I wonder, what if I had something to conceal --from my friends, my family, my employer, or gasp! my government). Guilt-by-association algorithms are becoming quite popular, and the so-called "War on Terror" is driving western democracies to take previously unseen measures towards their own citizens' privacy and liberties. As someone with near 100% chance of being subjected to "random" security checks in airports (still don't know why) I could not help to be a little touched with the matter.

Thus, I prepared a labeled Twitter dataset applying pattern-matching to user bios for all of the aforementioned attributes. Yes, even, the sensitive ones, and yes I felt a bit awkward.

Then, I re-adapted the work I was doing in community detection/user recommendation to perform user profiling and applied the algorithm to my Twitter user graph (I mentioned it in the past: 1.8M English speaking users) using 80% of the labeled users. Then, label assignments were checked against the remaining 20% and against users appearing in the WeFollow user directory.

The results? Rather shocking.

Detection of religious and political beliefs, sexual orientation, and race/ethnicity achieved above 95% precision. Sex and age achieved poorer results but, still, they were much more precise than a random classifier.

The implications are IMHO important (and a little scary), it means that simple algorithms can be used to label people, the most promising assignments can be manually checked and, hence, used to bootstrap the next iteration. Besides, we the users are doing most of the work by telling about ourselves; we are providing the labels, happily, for free, for anyone.

Maybe you are aware of this and don't care discussing your beliefs and personal choices, fine. But by doing that, those of your friends and acquaintances who conceal such an information are at risk.

Thus, after visiting the dark-side, I turned to study active measures users can adopt to avoid privacy risks due to data mining. At this moment I only have got an outline but, hopefully, it can developed into a full-fledged prototype along 2011.

What's the morale of this? The old saying "You are known by the company you keep" is absolutely true, so don't tell anybody who your friends are.

By the way, should you be interested in the full details of the study you can check this preprint:

"All liaisons are dangerous when all your friends are known to us"

Online Social Networks (OSNs) are used by millions of users worldwide. Academically speaking, there is little doubt about the usefulness of demographic studies conducted on OSNs and, hence, methods to label unknown users from small labeled samples are very useful. However, from the general public point of view, this can be a serious privacy concern. Thus, both topics are tackled in this paper: First, a new algorithm to perform user profiling in social networks is described, and its performance is reported and discussed. Secondly, the experiments –conducted on information usually considered sensitive– reveal that by just publicizing one’s contacts privacy is at risk and, thus, measures to minimize privacy leaks due to social graph data mining are outlined.

As always I'd be happy to hear your comments. Tweet me at @pfcdgayo.

Back Next