We learn from our mistakes: how to make better predictions from tweets
Social media is viewed as a potential goldmine of information. The key is to work out how to mine this abundant source of public sentiment.
But we got it wrong with Australia's same-sex marriage survey, and here's why.
We crunched the numbers
We carefully sampled the sentiment of 458,565 anonymised Australian tweets that made reference to same-sex marriage. We found 72% overall support for Yes. This was averaged out from the whole month of October.
But we noticed that some Twitter accounts had sent more than 1,000 tweets related to same-sex marriage. The number of unique users was down to just 207,287.
It seemed wise to minimise the influence of these bulk tweets because by the time they were sent, many of the votes had already been cast. Discounting the influence of the bulk tweets brought Yes support down to 57%.
Once we adjusted another 8% for the under-representation of the over-55 demographic in the Twitter sample, we concluded that the overall support for Yes was down to 49%.
With the benefit of hindsight
In previous successful trials we had assumed that all tweets are equal. If we had made the same assumption in this trial and did everything else the same, then – re-crunching the numbers – our prediction for Yes would have been 59.08%, which is close to the official result of 61.6%.
We made the incorrect assumption that the bulk tweeting would not be influential because the voting was spread across several weeks.
In our previous article we acknowledged the influence of bulk tweeting. We said that campaign tweets would have influenced public opinion to some degree, but we anticipated it to a much lower extent.
So there are lessons to be learned from this for any future analysis.
So far we've talked mainly about when we were wrong and why. But what about those times when the Big Data and Smart Analytics Lab got it right?
The Lab correctly predicted no less than 48 out of 50 US State elections held at the same time as the 2016 presidential election, which we also correctly called.
We called the Coalition's win in the 2016 Australian federal election. And our method gave a clear indication that "Brexit" would prevail over "Bremain", contrary to the polling before Britain's referendum on European Union membership.
In all of these cases, we were sampling the social media sentiment leading up to a specific election day when all would be decided. The election result is a snapshot of how the voters feel on that day.
With the same-sex marriage survey, the voting was spread across several weeks, making it difficult to know what proportion of the vote took place on a particular day or even week.
Even with this uncertainty, it was possible to make reasonably accurate predictions provided that the underlying assumptions are correct, such as all tweets being equally influential.
Twitter isn't the only source
With 328 million active users worldwide, and many more inactive users who nonetheless read the tweets of others, Twitter is an excellent source of information on people's views and intentions.
But it is good to have multiple sources of data when doing big data analytics.
In diverse projects, ranging from tourist satisfaction to environmental changes, the Big Data and Smart Analytics Lab uses combinations of Twitter, Flickr, Instagram, public Facebook pages, and even the Chinese social media platform Weibo. It is all grist to the mill.
Facebook is by far the dominant social media channel in the world. Only public pages are accessed by our analytics. But with two billion users and growing, we still have plenty of data to work with.
Twitter has evolved into a more news- and opinion-oriented channel, with people sharing newsworthy items with like-minded others. Celebrities and politicians use it as a direct channel to their audience, bypassing the established media channels altogether.
Brevity of tweets was enforced by a 140-character limit until recently, when the length restriction was doubled to 280 characters. The extra characters make tweets an even richer source of information for data mining.
The power of social media
The fact remains that people say things on social media that they would not say out loud. Many trolls and hecklers in the online world turn out to be mild-mannered individuals in the real world. It can be surprising.
Whose opinion is more interesting to the analyst? Is it the social persona who has responsibilities to the community and is generally polite? Or is it the private persona who only vents their true feelings to their closest confidants and on social media.
Both are interesting, but arguably it is the latter whose opinion determine the outcome of social issues.
The lesson to be learned from our error with same-sex marriage survey is that every social media post counts. Social media is indeed a powerfully democratising force.