Data Science

The 2nd Fallacy of Big Data - Information ≠ Insights

Since we digressed into the topic of influence over the past month, it’s time to return to big data and talk about another big data fallacy.


In my previous Big Data posts, we discussed the data-information inequality (a.k.a. the Big Data Fallacy): information << data. We talked about what is it, how to quantify it, and why it is the way it is. We delved pretty deeply and talked about some nontrivial concepts and statistical properties of big data. As a result, the discussion got a little mathematical. However, if you like the technicality, you should have a quick read of the following posts:

  1. The Big Data Fallacy: Data ≠ Information
  2. How Much Information is in Your Big Data and How Can You Measure It?
  3. Why is there so Much Statistical Redundancy in Big Data?


Today, I want to talk about the second fallacy of big data and discuss the distinction between information and insights. I promise I won’t go too deep into the statistics. But before I begin, I want to tie up a few loose ends concerning the statistical redundancy in big data.


Statistical Redundancy is not Bad

Although redundancy will limit the amount of extractable information from any data set, it is not inherently bad. Because the redundancy in all data set is a direct reflection of the correlation that exists in nature (see Why is there so Much Statistical Redundancy in Big Data?). So you shouldn’t try to remove redundancy in your data. If you do, your data won’t accurately reflect the reality anymore. Hence, the information you extract from your data will no longer be useful.


For example, retweets create a lot of redundancy in Twitter’s data. It inflates the data volume tremendously and turns Twitter into a big data company, whereas the actual information in Twitter’s data is actually several orders of magnitude smaller. But the redundancy created by retweets is a reflection of the reality that some people like certain content more than the other. If you try to remove all retweets, you will reduce the redundancy in Twitter’s data. The data volume will shrink and the gap between data volume and information volume will decrease. But then you won’t be able to see which content people like more. In fact, you will come to the conclusion that all tweets are equal, which is not interesting, not useful, and totally incorrect.


Statistical redundancy in big data is not good or bad. It inflates the data volume significantly without increasing the actual information content. However, it is also a direct reflection of the way things operate in nature and should not be eliminated. It is just an intrinsic property of all data, including big data. We just have to understand it and live with it.


The Second Fallacy of Big Data&colon; Insight << Information

OK, now we are ready to discuss the second fallacy of big data. The promise of big data is that one could extract lots of information and uncover valuable insights from it. With the data-information inequality, we learned that the total amount of information we can extract from big data is actually much smaller than the raw data volume. Now, the question is what about valuable insights?


Insights are information, but not all information provides insights. There are three criteria for information to provide valuable insights:

  1. Interpretable
  2. Relevant: this criterion significantly restricts the amount of valuable insights we can derive from big data (see If 99.99% of Big Data is Irrelevant, Why Do We Need It?)
  3. Novel


If the information fails any one of these criteria, then it couldn’t be a valuable insight. So these three criteria will successively restrict insights to a tiny subset of the extractable information. Out of a thousand bits of information we extract from big data, we’d be lucky if just one bit is a valuable insight. So in general, the second fallacy of big data is: insight << information.


This can be combined with the data-information inequality (a.k.a. the first fallacy of big data): information << data.


So both big data fallacies can be summarize in a single inequality relationship: insight << information << data.


So even with big data, the probability for finding valuable insights from it will still be abysmal. This may sound disappointing, but believe it or not, these big data fallacies are actually strong arguments for why we need big data. We just have to look at this inequality from the other side.


Since the amount of valuable insights we can derive from big data is so tiny, we need to collect even more data to increase our chance of finding them. If the human population consists of 1% genius, you are more likely to find a genius if you look at a random sample of population > 100. Unfortunately, the probability of insight discovery is much smaller than 1%, that’s why we need petabytes of data and powerful analytics to have any hope of finding that million dollar insight.



First, we clarified that statistical redundancy in big data is an intrinsic property of all data. Even though it limits the amount of information we can extract from big data, it is not bad, and we shouldn’t try to remove them. Moreover, statistical redundancy reveals the reality of what we are measuring.


The second big data fallacy is that most people believe that with big data we can get a lot of valuable insights. This is not true, because insights << information << data. Insights are information, but information must satisfy three criteria to provide insights that are valuable:

  1. Interpretability
  2. Relevance
  3. Novelty


These criteria imply that insights are a much smaller subset of information. Although big data cannot guarantee the revelation of many insights, increasing the data volume does increase the odds of finding it.


Next time we will examine these three criteria more carefully, so we know where to look within big data to find insights. In the meantime, let’s have some open discussion about the path from data to information to insights. If you have any inspirational story about how you discover insights from data, feel free to share it here.


Stay tuned till next time...



Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.


Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.


  • Big Data

Hi Michael,

As always, your post is simple and precise, and so rigourous Smiley Happy Thanks a lot for this work.



Data Science

Hello Raphaelle,


Thank you for the nice compliment.

I hope it is helpful as it is enjoyable.


Hope to see you next time.

Hi Michael,


Post is simple and clear to understand.

I feel this theory of yours is obvious for any size of the it small or big. Always, the raw data is big in size when compared to the meaningful information derived from it and meaningful insights are less in number when derived from information.


Data Science

Hello Aravind,


Thank you for the compliment.


However, I don't think this a theory per se. It's simply an observation of fact based on many years working with data. It's a simple equation, but it does have quite a bit of depth to it. Specially the 1st fallacy of big data. If you are interested in the deeper mathematics behind it, I recommend you take a look at 2 of my earlier posts on this subject.


  1. How Much Information is in Your Big Data and How Can You Measure It?
  2. Why is there so Much Statistical Redundancy in Big Data?


There are plenty of links in the above posts if like to learn more about information theory.


OK, thanks again for taking the time to comment. 

See you next time.



Michael, I believe that the missing ingredient is context, which is typically in the form of data, but it needs to be derived and correlated. So, data + context (or metadata) = information. And information + context (or metainformation) = insight. Big Data not only needs to get bigger in order to capture more data that is potentially context for other data, but it also needs to get "deeper" in terms of analytics capabilities both in terms of heurestics and visualization because the most valuable context is usually not what's most obviously relevant.


As the phrase goes, "Content (or data) may be king, but context is God." :-)


Happy holidays and best wishes for the new year!

Data Science

 Hello Lawrence,


Thank for the comment, and I apologize for the late reply due to the holiday.


That is a good point. Context is extremely important. In fact, I’ve already planned a post devoted to that subject. As you said, context is just other data (i.e. metadata) that help you understand the data. So you will need to extract the information from those data as well in order for them to help you interpret and understand other data too.


Anyway, in a later post, I will define what context is and when and where it is useful.

That is, when is context god, and when is it the devil.  ;-)  So stay tuned...


Just a clarification, the kind of information that I am talking about here is the entropic information from the information theoretic perspective. People often speak of information loosely, but there is a rigorous and mathematical definition of what information is. It is completely independent of context.


More precisely, if there are contextual data in the data set, the entropic information will encompass it all. If the contextual data and information is not within the same data set, than context is no longer a well define concept. You are basically adding information from your brain (i.e. your knowledge, experience, etc.) and mixing it with the data set that you are looking at. In that  case, information become an interpretation. Based on what contextual information you have beyond those within the data set, you may interpret the data differently and gain different information and insights from the same data set. Interpretation is thus a subjective quantity. Moreover, any data set can have infinite number of interpretation.


If you are still unclear about the information theoretic view of data and information, I suggest that you check out the discussions following the this post: The Big Data Fallacy: Data ≠ Information.


Alright, Hope you have a wonderful Christmas and a happy new year.

See you around next time.



Lithy Awards 2017

Voting is now closed. Winners will be announced on June 14th!!

See the nominees!