cancel
Showing results for
Did you mean:
Gamification
Data Science

## Learning the Science of Prediction: How do You Know Your Influence Score is Correct – Part 1

This post is the first of a two part article addressing the question: How do you know if your influence score is correct? Today, I won’t actually answer this question, but will show you a step-by-step procedure that we will use next time to address this question.

In my last article, I wrote about the missing link of influence. We talk about the fact that nobody actually has any data on influence (i.e. data that explicitly says who actually influenced who, when, where, how, etc.). Even influence vendors don’t have this measured influence data. The only measured data they have is people’s social media activities, which they actually didn’t measure themselves, but instead collect from the respective social media platforms.

All influence scores are therefore computed from users’ social activity data based on some models and algorithms of how influence work. However, anyone can create these models and algorithms. So who is right, and who has the best model? More importantly how can we tell and be sure your influence score is correct? In other words, how can we validate the models' accuracy that influence vendors use to compute people’s influence score?

To illustrate how statistical validation works, I will use a simpler and more tangible example, where we are trying to predict the stock price of a company, let’s say Apple.

Build a Predictive Model of the Stock Market

First, we need to build a model (or an algorithm) which takes various input data about Apple that might be predictive of its stock price. We can pick any data that we feel could potentially affect Apple’s stock price in any way as an input. For example:

1. Sales data: units shipped for Apple devices, earning data from different business units of Apple (e.g. iTune store, Apple store, smartphones, tablets, laptops, etc.). Obviously the stock price should reflect how well the company do in terms of sales
2. Fundamental company data: management, debts, liabilities, cash flow, etc. Various ratios that tells you different aspects of the financial health of the company may be a good predictor of its stock price
3. Social data: share of voice and sentiments about Apple products and services (e.g. iCloud, etc.). Perhaps social media is indicative of public sentiment towards Apple and therefore can predict its traded volume and therefore the price
4. Competitor data: all the above data from different competitors of Apple (e.g. Google, Dell, HP, RIM, etc.). Maybe Apple’s stock price will be anti-correlated with the performance of its competitors
5. Industry and market data: international and national economic indicators, such as GDP growth rates, inflation, interest rates, exchange rates, productivity, energy prices, various market indices (e.g. S&P 500, Dow Jones, etc.), any industry-wide data on the technology sector, personal computing, and/or mobile phone. Apple will certainly be subjected to the same market forces that affect the industry, so may be its price will follow the industry trend to some extent

How do You Validate the Model You’ve Built?

The important point is that regardless of how much data we put into the model, and how complex and brilliant the model might be in combining these data, the final test for whether the model actually works, is to see if it can predict the real stock price of Apple. How good a model is, has nothing to do with its complexity or how much data it takes into consideration. If it doesn’t predict accurately, the model is no good regardless of how logical or scientifically sound the model is. So prediction accuracy offers an objective and empirical way to validate any statistical model.

There are three requirements to validate any statistical model or algorithm:

1. We need a model or algorithm that computes some predicted outcome (e.g. stock price of a company, weather in SF tomorrow, earthquake, or someone’s influence)
2. An independent measure of the outcome that the model is trying to predict
3. A measure that compares and quantifies how closely the predicted outcome matches the independently measured outcome

The most important of these is #2: having an independent measure of the outcome. It is pretty obvious if you think about it. To validate if your model can accurately predict the stock price for Apple, you must have the actual stock price of Apple, so you can compare the prediction against the actual stock price.

What does “Independently Measured” Mean and Why is it so Important?

Many people don’t understand what it means to be “independent.” To be independently measured means the measured outcome is completely independent of the model. In the example of predicting Apple’s stock price, it means you cannot use any of the actual stock price data as part of the input to the model. If you use any actual stock price data as input to a model that is trying to predict the stock price, then it’s obvious that the model would predict the stock price very well, because the model would already have information about the actual stock price. So, the actual stock price that you thought you measured independently will no longer be truly independent of the model anymore.

Hence the fact that this model is able to predict Apple’s stock price well is meaningless, because it didn’t actually predict anything, after all it already has the actual stock price that it is trying to predict. This model is basically cheating because it’s based on circular reasoning.

Conclusion

Today we illustrated the predictive validation framework through an example of predicting Apple’s stock price. This predictive validation framework is very general and can be used to validate any models (or algorithms).

To properly validate a model (any model), we must be able to compare the model’s predicted outcome with an independent measure of the outcome. Here, the outcome can be literally anything (e.g. stock price, influence, weather, earthquake, etc.). I’d like to re-emphasize the importance of having an independent measure that is truly independent of the model. That means you cannot use this measure anywhere in your model. Otherwise, the validation procedure will be confounded by circular reasoning.

Alright, now you know how to validate any model, next time we shall apply this framework to analyze the models that influence vendors use to score people’s influence. And we will be able to answer the question posed at the beginning of this post: How do you know if your influence score is correct?

Stay tuned... Have a warm and relaxing Thanksgiving... And see you next time.

Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.

Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.

• Influencers

Great read!  The other hard part to the process is defining what is a type 1 or type 2 error when analyzing the data.  Yes, you may get a model that predicts the accurate, in this case, price of a stock.  However, it may have been measuring information that truly doesn't provide conclusive evidence.  On the other hand, you may have the right data, but how you draw the conclusion is skewing the result.  Thanks again for the article Michael!

Gamification
Data Science

Hello AaronEllsworth,

Thank you for taking the time to comment.

You are absolutely right. Analyzing the error is just as important as analyzing the model and it’s prediction. In fact, I think of them as complementary. You cannot really understand the model unless you understand what kind of error it’s making. Unfortunately, many folks don't care enough about statistics to learn how to analyze the model, which is why I want to write this post and emphasize the importance of validation.

For those of you who don’t know, Type I errros are basically false positive (i.e. the model wrongly predicted that something interesting that is actually not there), and Type II errors are basically false negative (i.e. the model says there isn't anything interesting, but there is actually something). There are more rigorous definitions of what these errors means in terms of classical hypothesis testing.

Glad to see someone knows stats joining the discussion here.

Thanks for the comment and I hope to see you again in the future.

Hello Mike,

I started working on influence anlysis in social networks and found your articles very inspiring, informative and useful. I'm reading all of your articles and thanks for all these discussions.

It's true that testing a model is always an important part of any product development. People have tried to use other available scores as a benchmark and test accordingly, which should not be the case. As you discussed earlier and also we can observe that all those already available sources have their own limitations or negatives.

To avoid this, we should look forward to have a testing or verification procedure which is independent of any model and it's results should be intuitive for human being. Because this is what our ultimate target is {to model human thoughts :-) }.

I would like to know if you have also discussed somewhere about algorithms or methods from Machine Learning which we can consider for better modelling? If not, any plans ahead to atleast overview about them?

Thanks,

Nitin

Gamification
Data Science

Hello Nitin,

Thank you for your nice comment and the inquiry. I'm glad to hear that you find my work inspiring and useful.

If you are interested specifically on influence analysis on soical networks, you can find all my writings under the label "." The earliest set of writing on this topic has also been compiled into a chapter: My Chapter on Influencers. Some of the posts in there talks about the model and algorithm I used for computing or estimating people's potential to influence. These later articles will go into Chapter 2 someday.

I've given talks at conferences and meetings on how I validate my influence algorithm using an independent data source that we have, but I have not written on that topic yet. However, I do plan to write about it later.

Alright, not sure if this helps, but at least I do plan to write about our validation mechanism later.

Thank for your interest again and hope to see you again on my blog.