Big Data, Big Prediction? – Looking through the Predictive Window into the Future
Michael Wu, Ph.D. is Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and group behavior in online communities and social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics and its application to Social CRM.He's a regular blogger on the Lithosphere's Building Community blog and previously wrote in the Analytic Science blog. You can follow him on Twitter or Google+.
Last time I said I was going to write about the big data processing pipeline. However, I decided to put that off until later, mainly because a couple of weeks ago, I was interviewed by USA Today on whether social media sentiment can predict an election outcome? As it turns out, the raw twitter sentiment data from Attensity was only able to predict the election outcome half right. So I thought it is timely to comment on the predictive power of big data.
Analysis of Sentiment Data for Each Presidential Candidate
During this interview, I pulled some data from our own social media monitoring (SMM) platform, and did a simple analysis of the public sentiment data on the social web (which includes the blogs, twitter, forums, and news) for the top four Republican candidates (Mitt Romney, Newt Gingrich, Rick Santorum, and Ron Paul) and the incumbent candidate, President Barak Obama.
- Our platform estimated the entity level sentiment for each mention of the candidate.
- Our SMM platform automatically aggregated this raw sentiment data by day, for positive, negative, and neutral mentions. So we have total positive, neutral and negative mentions by day.
- I looked at the daily sentiment variation for each candidate over the last 6 months and determined the window over which the sentiment is stable and therefore predictive. I found this window is about 1.5 to 2 weeks. That means I can only use about two weeks of sentiment data for prediction. Using more data not only doesn’t help, it may be counterproductive. That is, using too much data could actually reduce your prediction accuracy. More data is not always a good thing!
- I computed the net sentiment by taking the positive mentions minus the negative mentions for each candidate over the two week period. It can be very misleading to examine only the positive sentiment since that is a biased and incomplete reflection of the public sentiment, so we must take into account the negative sentiment in our analysis.
- Likewise, we should probably take advantage of the large amount of neutral sentiment data on the social web too. There are many neutral mentions we haven’t use yet. To make use of this data, I simply weighted each neutral mention by 1/10 and added the result to the net sentiment computed above.
- The result is normalized to 100% and displayed via a pie chart (Figure 1).
I was told that these data (now 2 weeks old, which makes them irrelevant today) actually lined up with the Gallup Poll nicely. I was so excited that I went and look up the Gallup data (see Figure 2) even though I haven’t been following the election closely.
To objectively quantify how well my analysis is able to predict the Gallup poll, I’ve computed the correlation coefficient between my prediction and Gallup data. To my great surprise, my simple analysis yields a predictive correlation coefficient of 0.965. That means this analysis is able to explain 93.11% of the variance in Gallup data. This is a superb result even though the model is overly simplistic.
But what does this really mean? Can social media sentiment really predict election outcome?
The Right Question to Ask: What is the Predictive Window?
Social media sentiment can definitely be used to predict election outcomes! Studies have shown that twitter mood data can even predict the stock market. However, whether social media data can predict election outcome is actually the wrong question to ask.
The important question to ask when doing any predictive analysis is “how far into the future is the prediction valid?” We call the period over which the prediction is still fairly accurate the “predictive window.”
Let’s look at a more familiar example of weather prediction. You can certainly use data collected today from all the meteorological instruments out there to predict the weather, but the prediction is only accurate for a short period of time, typically a few days. So the predictive window of your meteorological data is a few days. You could try to use this data to predict the weather one month from now, but it just wouldn’t be accurate. In fact, it’s so inaccurate that you might as well take a random guess. So trying to predict anything beyond the predictive window of your data is, pretty much, useless.
In our example, even though social media sentiment can be used to predict election outcome, it can’t predict with any accuracy beyond a window of 1.5 weeks. If the election takes place a day after the sentiment data were collected, then it is possible to predict the election outcome after some serious human analysis. However, if the election takes place 2 weeks after, then these data would no longer be able to predict the outcome with any accuracy, no matter how much we analyze it. This kind of behavior is very common in non-stationary systems.
You may ask “why?” The reason is because sentiment is a point-in-time measure. It can change rapidly from day to day. I may love candidate #1 (say Obama) yesterday, and I tweeted about it, but after I watched his debate today, my sentiment may change or reverse completely. So my tweet from yesterday is completely irrelevant with respect to my candidate preference today.
So the important question is not “whether social media data can predict election outcome?” It definitely can. The right question to ask is “how long is the predictive window?” For something that changes very quickly like the financial market, the predictive window will be very short. For things that do not change as fast, the predictive window will be longer. For social media sentiment data, the predictive window for election forecasting is about 1.5 to 2 weeks. If you want to be conservative, you can use 1 week.
When you are doing any predictive analytics, you are really trying to peek into the future through the predictive window of your data. If you try to look outside of this window, your future will look very blurry; so blurry that you can’t make anything out of it with any certainty.
Even within this predictive window, your view is still limited by the power of your statistical model and the noise inherent in the data. Because of this, you often can’t see very far into the future. Although you can sometime stretch this window a little by using more powerful statistical and machine learning methods, it is often impractical and of diminishing return.
The scientific community has always been interested in prediction ever since the scientific method was developed. However, predictive analytics has proven to be a very challenging subject in mathematical statistics and probability theory. It is not a problem that can be addressed simply with more advanced technology. That means it doesn’t matter how much computing power you have (even with quantum computer and holographic storage), there are theoretical limits to what you can, and cannot, predict.
Next time we’ll talk about why this predictive window is so important. And I will use our Super Tuesday sentiment data as an example to illustrate how we can improve the visibility within this prediction window.
BTW, I’ll be teaching at the Rotman Executive CRM program again this year with Paul Greenberg and Ray Wang. So I’ll be on the University of Toronto campus April 17-19. Alright, see you next time. Stay tuned for more on election campaign analytics...
You must be a registered user to add a comment here. If you've already registered, please log in. If you haven't registered yet, please register and log in.