Data Science

The Key to Insight Discovery: Where to Look in Big Data to Find Insights

Welcome back, I hope you all had a wonderful new year. Let's pick up where I left off last year.


Last time we talk about the second fallacy of big data -- insights << information. The reason why this inequality came about is because there are three criteria for information to provide valuable insights. The information must be:

  1. Interpretable
  2. Relevant
  3. Novel


data info insight v02d1.png


Today we will examine these criteria to get a deeper understanding of what they really mean. Since these criteria narrow the location where insights are found to a tiny subset of the extractable information from big data, a clear understanding of these criteria will help us discover insights from big data.


The Interpretability of Data and Information

First criterion, an insight must be an interpretable piece of information. Since big data contains so much unstructured data and different media as well as data types, there are actually substantial amount of data and information that is not interpretable.


Take for example, the sequence of numbers: 123, 54, 246, 15, and 203. What do these numbers mean?


It could be the money you won at the poker table for the last five weeks; it could be the RGB value of 5 pixels in an image; or it could be the location of a gene locus where a single base-pair mutation occurred. Without more information and meta-data, there is no way to interpret what these numbers mean. It does contain some information though, but we just won’t know what the information represents. Since this data and the information it contains is not interpretable, it will not be able to provide any insight.


Since insights must be interpretable, it lies within the interpretable part of your big data. And the interpretable information will always be a smaller subset of the total extractable information.


Easy, right? But don’t get excited too early. Interpretable information is not always useful. And if it’s not useful, it won’t have much value.


The Subjectivity of Relevance

This brings us to the second criterion. That is, the information must be relevant (i.e. useful) to provide valuable insights.


From criterion 1, we know that if you are looking for valuable insights, you must look within the interpretable information. Now, because you can interpret this information, you can always tell whether it is useful to you or not. But what’s useful is subjective, because relevance is in the eyes of the beholder. Information that is relevant and useful to me, may be completely irrelevant to you, and vice versa.


data info insight v02d2.pngThis is what Edward Ng, a renowned mathematician and astrophysicist, means when he says “One man’s signal is another man’s noise.” Furthermore, relevance is not only subjective, it is also contextual. What is relevant to a person may change from one context to another. If I’m visiting NYC next week, then the weather in NY will suddenly become very relevant to me. But after I return to SF, the same information will become irrelevant.


The important point here is that both signal and noise are subsets of the interpretable information extracted from your big data. Noise is the irrelevant information that you don’t want, and it is typically a much larger subset (see If 99.99% of Big Data is Irrelevant, Why Do We Need It?). Signal is the relevant information that you want, and it is usually a very tiny subset. Since any valuable insight to a person must also be relevant to him, these insights are an even smaller subset within the relevant information (i.e. signals), which is already a tiny subset of the interpretable information.


Remember, the amount of data and the amount of information are both absolute quantities. Both are objectively quantifiable by information theory. Contrary to common belief, information is NOT subjective. Information is NOT in the eyes of the beholder, but relevant information is. Subjectivity only enters when relevance and utility is concerned.


The Scarcity of Novelty

The third criterion for information to be insightful is that it must be novel. That means it must provide some new knowledge that you don’t already know.


Clearly this criterion is also subjective. The things I know are very different from what you know, so what is insightful to me may be old information to you, and vice versa. Part of this subjectivity is inherited from the subjectivity of relevance. If some information is irrelevant to you, then most likely you won’t know about it, so when you learn it, it will be new. Information that is irrelevant to you is more likely to be novel to you. But you probably wouldn’t care because it’s irrelevant, so even if it is novel it’s of no value to you.


It appears we must look within the relevant information (i.e. the signals) extracted from big data in order to uncover the valuable insights. However, if you do this, most of the information you find will not be new. Once an insight is found, it’s no longer new and insightful the next time you time you find it again. Therefore as we learn and accumulate knowledge from big data, insights become harder to discover. The valuable insight that everyone wants is a tiny and shrinking subset of the relevant information (i.e. the signal).


data info insight v02d4.pngAlthough there are many uses of big data for purely discovery purpose, most of these are in scientific research and academia. In business, data is often used to address very specific problems or decisions, so business analysts are usually looking for something very specific. Consequently, the relevance criteria will be very stringent, and the signal (i.e. the relevant information) will be constrained to a very tiny subset. Under such a restrictive scope of relevance, it is very hard to discover information that you don’t already know.


You Don’t Know What You Don’t Know

If you recall from the last section, it’s much easier to find new knowledge when you look within the irrelevant information (i.e. the noise), because that’s the information you most likely won’t already know.


The irrelevant nature of this information also means you probably won’t find anything useful or valuable most of the time, but that is a risk you have to take if you are trying to uncover insights that are truly novel. However, every so often you will find something that relates to some relevant information in ways you don’t know before, and an insight is discovered. This apparently irrelevant information is really relevant after all. We only thought it’s irrelevant because we didn’t know about its relationship to the relevant information. We simply didn’t know what we didn’t know.


Therefore, the key to uncovering insights is that you must sometimes look beyond the boundary of the relevant information, even when the value and utility is unclear. This is the same reason why innovation typically occurs at the boundary between several disciplines of knowledge. This is risky business, because there is no guarantee that you will ever find anything useful or valuable. As you venture into the irrelevant information, you often won’t know what you are looking for or what you will find. However, this type of exploratory data analysis (EDA) is crucial to discovering insights. We will devote another article to talk about EDA in the future.



Today, we examined the three criteria necessary for information to provide valued insights.

key to insight.png

  1. The information must be interpretable: Big data actually contains much unstructured data, and information within rich media that are un-interpretable
  2. It must be relevant: Keep in mind that relevance (signal) and irrelevance (noise) are subjective and contextual, yet both are subsets of the same interpretable information
  3. It must provide something novel: In most business context, the relevance constraint is very stringent. And under such tight constraint of relevance, novelty is very scarce


The key to insight discovery is to understand that it is much easier to find new information that you don’t already know when you venture into the irrelevant information (or what you thought was irrelevant). You may not find anything useful or valuable (due to irrelevance), but when you do, it will be novel and insightful. So my advice to all of you is to be more exploratory when analyzing your big data. Because that is how you discover insights and innovate.


OK, hopefully this post is insightful. Meanwhile, let me know what you want to hear next time, or we can further the discussion of this topic below.



Michael Wu, Ph.D.mwu_whiteKangolHat_blog.jpg is 927iC9C1FD6224627807Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.


Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.



Great piece Michael, you've inspired me to see that thinslicing connects the data with the behavour that creates it:-)

Data Science

Hello Stuart,


Thank you for the comment and glad to hear that this is an inspiration to you.


This is one of those aha moment that I have 10+ years ago when I was analyzing tones of brain data (fMRI and eletrophysiology). I just like to share it to my friends in the industry. So I'm glad that it still have the power to inspire after so many year.


Feel free to discuss about thinslicing as I don't think most people know what it is.


Thanks for the comment and see you next time.


Thanks Michael, perhaps I should say what I don't mean by thinslicing as a strategic tool, is that summed by nicely in these two paragraphs written by Bob Thompson on the CustomerThink community:


"Despite our best efforts to collect and analyze data, good business decisions will always include elements of judgment, intuition or just plain luck. Many day-to-day decisions are made with little or no thought, because the option selected just seems "right." Gut-feel decisions might be examples of what Malcolm Gladwell called "thin-slicing" in his provocative 2005 bestseller Blink.


"However, the best decision can sometimes be counter-intuitive. For example, the financial services firm Assurant Solutions wanted to improve its "save" rate on customers calling in to cancel their protection insurance. The industry's conventional wisdom, which resulted in 15-16% retention rates, was to focus on reducing wait time to boost customer satisfaction. But data analysis found a solution that tripled the retention rate: matching customer service reps with customers based on rapport and affinity."


What I mean is the approach to data as you outline above which I categorize as thinslicing, coupled with the way consumers make purchasing decisions - which like good business "will always include elements of judgment, intuition or just plain luck".


In other words by thinslicing rather than using intuition to make decisions I mean the strategy of connecting the means of analyzing the data with the way the data is created, to achieve better results. I would be interested therefore if you'd ever come across any such examples.

Data Science

Hello Stuart,


Thank you for continuing the conversation. I apologize for the late reply. I’ve been away from the office until now.


I read Malcolm Gladwell’s Blink, and are quite familiar with his theory of thin-slicing. As I understand it, thin-slicing refers to people’s extraordinary ability to find patterns and outliers from a sea of data based on a very narrow window of experience (i.e. the thin slice). People have develop this skill over the evolutionary time scale, because in nearly all situations, we will never have ALL the data necessary to back our decision. Most people simply pick some thin slice of data (i.e. data that is available or easy to obtain) from all the relevant data out there (which most people can’t access) to help them make a timely decision. So I see thin-slicing as one way that majority of the population make decisions.


Insight discovery certainly requires a lot of judgment. First of all, a data scientist must decide part of the big data we need to focus on: what is relevant (i.e. signal) and what is irrelevant (i.e. noise), etc. What models should they use, what techniques and computation should they employ. And if they want to venture out into the irrelevant data, which part should they focus on. Remember, for any specific problem, most of the big data (probably >99.99%) is irrelevant, so the space of irrelevant data is huge. Systematic brute force approach is likely to be futile, or will simply takes too long to have any value. Therefore, analyst and data scientist must thin-slice when you are trying to extract information and find insights from big data.


However, thin-slicing alone is not sufficient, that is where exploratory data analysis (EDA) comes in. As mention in the post, I will devote a post to this important topic next. Nevertheless, it is important to always consider how the data is generated, capture, pre-processed, stored, cleaned, etc. before analyzing it. Good data scientists must know everything that happen to the data, from its creation, all the way to the point where they get their hands on the data. It is actually a pretty standard practice for hardcore financial/business analysts. Not only you need to “connecting the means of analyzing the data with the way the data is created,” you must know everything that happen to the data along the way, until the data reaches you (or the analyst). Only then can you be certain that your analysis is not biased or confounded by something before you get your hands on it. In statistics term, only then can you know the confidence interval of your result.


Ok, I hope this discussion is helpful.

I will write about EDA next, so please stay tuned, and see you next time.


Thanks Michael, I take your point that: "Good data scientists must know everything that happen to the data, from its creation, all the way to the point where they get their hands on the data." Certainly makes sense. And now looking forward to your next post on exploratory data analysis!

Mike, Thanks very interesting and insightful thoughts. Looking forward to your thoughts on Data in motion & Event Processing. I work on these technologies and I think people are yet to understand that the real and quick value will come from data in motion followed by analysis of data at rest.


Lithy Awards 2017

The winners in digital CX have been crowned!

See the winners!!