Ad verba per numeros
Wednesday, June 27, 2012, 08:18 AM
After writing that I decided to be a bit more constructive, try to see the glass half full instead of half empty, and improve the advice I had provided there (especially regarding baselines).
Hence, I've prepared a new paper where I conduct a meta-analysis on Twitter electoral predictions (note: only those made in scholar papers) to reach some conclusions:
- With regards to predictions based on raw counts:
- It is too dependent on arbitrary decisions such as the parties or candidates to be considered, or the selection of a period for collecting the data.
- Its performance is too unstable and strongly dependent on such parameterizations, and
- Considering the reported results as a whole it seems plausible that positive results could have been due to chance or, even, to unintentional data dredging due to post hoc analysis.
- With regards to predictions based on sentiment analysis:
- It is unclear the impact that sentiment analysis has in Twitter-based predictions. The studies applying this technique are fewer than those counting tweets and the picture they convey is confusing to say the least.
- However, taking into consideration that even naïve sentiment analysis seems to outperform a reasonable baseline it is clear that further research is needed in that line
- Both approaches share a number of weaknesses:
- All of them are post hoc analysis.
- Proposed baselines are too simplistic.
- Sentiment analysis is applied with naïveté since commonly used methods are slightly better than random classifiers and fail to catch the subtleties of political discourse.
- All of the tweets are assumed to be trustworthy when it is not the case.
- Demographics bias is neglected even when it is well known that social media is not a random sample of the population.
- Self-selection bias is also ignored although it is well known that supporters are much more vocal and responsible of most of the content.
- Period and method of collection: i.e., the dates when tweets were collected, and the parameterization used to collect them.
- Data cleansing measures:
- Purity: i.e., to guarantee that only tweets from prospective voters are used to make the prediction.
- Debiasing: i.e., to guarantee that any demographic bias in the Twitter user base is removed.
- Denoising: i.e., to remove tweets not dealing with voter opinions (e.g. spam or disinformation) or even users not corresponding to actual prospective voters (e.g. spammers, robots, or propagandists).
- Prediction method and its nature:
- The method to infer voting intentions from tweets.
- The nature of the inference: i.e., whether the method predicts individual votes or aggregated vote rates.
- The nature of the prediction: i.e., whether the method predicts just a winner or vote rates for each candidate.
- Granularity: i.e., the level at which the prediction is made (e.g. district, state, or national).
- Performance evaluation: i.e., the way in which the prediction is compared with the actual outcome of the election.
Finally, what would be an appropriate way to evaluate performance?
Certainly MAE (Mean Absolute Error) is commonly applied but this measure changes from election to election so a baseline must be used with each election so MAE of the system is compared against that of the baseline.
What would be that baseline?
I propose using the results of the immediately prior election as a prediction. That is, assuming the same results are to be obtained.
Certainly, this has got issues: e.g., new parties running for election or coalitions created or dismantled between elections. Still, it is simple and can provide an intuitive hint about how "hard" or "easy" to predict an election can be.
Such a baseline was used to determine the performance of each prediction made to date.
And that's all! You can find the paper in arXiv and you can send me your comments on Twitter (@PFCdgayo).
Next