The Key to Insight Discovery: Where to Look in Big Data to Find Insights
Welcome back, I hope you all had a wonderful new year. Let's pick up where I left off last year.
Last time we talk about the second fallacy of big data -- insights << information. The reason why this inequality came about is because there are three criteria for information to provide valuable insights. The information must be:
Today we will examine these criteria to get a deeper understanding of what they really mean. Since these criteria narrow the location where insights are found to a tiny subset of the extractable information from big data, a clear understanding of these criteria will help us discover insights from big data.
The Interpretability of Data and Information
First criterion, an insight must be an interpretable piece of information. Since big data contains so much unstructured data and different media as well as data types, there are actually substantial amount of data and information that is not interpretable.
Take for example, the sequence of numbers: 123, 54, 246, 15, and 203. What do these numbers mean?
It could be the money you won at the poker table for the last five weeks; it could be the RGB value of 5 pixels in an image; or it could be the location of a gene locus where a single base-pair mutation occurred. Without more information and meta-data, there is no way to interpret what these numbers mean. It does contain some information though, but we just won’t know what the information represents. Since this data and the information it contains is not interpretable, it will not be able to provide any insight.
Since insights must be interpretable, it lies within the interpretable part of your big data. And the interpretable information will always be a smaller subset of the total extractable information.
Easy, right? But don’t get excited too early. Interpretable information is not always useful. And if it’s not useful, it won’t have much value.
The Subjectivity of Relevance
This brings us to the second criterion. That is, the information must be relevant (i.e. useful) to provide valuable insights.
From criterion 1, we know that if you are looking for valuable insights, you must look within the interpretable information. Now, because you can interpret this information, you can always tell whether it is useful to you or not. But what’s useful is subjective, because relevance is in the eyes of the beholder. Information that is relevant and useful to me, may be completely irrelevant to you, and vice versa.
This is what Edward Ng, a renowned mathematician and astrophysicist, means when he says “One man’s signal is another man’s noise.” Furthermore, relevance is not only subjective, it is also contextual. What is relevant to a person may change from one context to another. If I’m visiting NYC next week, then the weather in NY will suddenly become very relevant to me. But after I return to SF, the same information will become irrelevant.
The important point here is that both signal and noise are subsets of the interpretable information extracted from your big data. Noise is the irrelevant information that you don’t want, and it is typically a much larger subset (see If 99.99% of Big Data is Irrelevant, Why Do We Need It?). Signal is the relevant information that you want, and it is usually a very tiny subset. Since any valuable insight to a person must also be relevant to him, these insights are an even smaller subset within the relevant information (i.e. signals), which is already a tiny subset of the interpretable information.
Remember, the amount of data and the amount of information are both absolute quantities. Both are objectively quantifiable by information theory. Contrary to common belief, information is NOT subjective. Information is NOT in the eyes of the beholder, but relevant information is. Subjectivity only enters when relevance and utility is concerned.
The Scarcity of Novelty
The third criterion for information to be insightful is that it must be novel. That means it must provide some new knowledge that you don’t already know.
Clearly this criterion is also subjective. The things I know are very different from what you know, so what is insightful to me may be old information to you, and vice versa. Part of this subjectivity is inherited from the subjectivity of relevance. If some information is irrelevant to you, then most likely you won’t know about it, so when you learn it, it will be new. Information that is irrelevant to you is more likely to be novel to you. But you probably wouldn’t care because it’s irrelevant, so even if it is novel it’s of no value to you.
It appears we must look within the relevant information (i.e. the signals) extracted from big data in order to uncover the valuable insights. However, if you do this, most of the information you find will not be new. Once an insight is found, it’s no longer new and insightful the next time you time you find it again. Therefore as we learn and accumulate knowledge from big data, insights become harder to discover. The valuable insight that everyone wants is a tiny and shrinking subset of the relevant information (i.e. the signal).
Although there are many uses of big data for purely discovery purpose, most of these are in scientific research and academia. In business, data is often used to address very specific problems or decisions, so business analysts are usually looking for something very specific. Consequently, the relevance criteria will be very stringent, and the signal (i.e. the relevant information) will be constrained to a very tiny subset. Under such a restrictive scope of relevance, it is very hard to discover information that you don’t already know.
You Don’t Know What You Don’t Know
If you recall from the last section, it’s much easier to find new knowledge when you look within the irrelevant information (i.e. the noise), because that’s the information you most likely won’t already know.
The irrelevant nature of this information also means you probably won’t find anything useful or valuable most of the time, but that is a risk you have to take if you are trying to uncover insights that are truly novel. However, every so often you will find something that relates to some relevant information in ways you don’t know before, and an insight is discovered. This apparently irrelevant information is really relevant after all. We only thought it’s irrelevant because we didn’t know about its relationship to the relevant information. We simply didn’t know what we didn’t know.
Therefore, the key to uncovering insights is that you must sometimes look beyond the boundary of the relevant information, even when the value and utility is unclear. This is the same reason why innovation typically occurs at the boundary between several disciplines of knowledge. This is risky business, because there is no guarantee that you will ever find anything useful or valuable. As you venture into the irrelevant information, you often won’t know what you are looking for or what you will find. However, this type of exploratory data analysis (EDA) is crucial to discovering insights. We will devote another article to talk about EDA in the future.
Today, we examined the three criteria necessary for information to provide valued insights.
- The information must be interpretable: Big data actually contains much unstructured data, and information within rich media that are un-interpretable
- It must be relevant: Keep in mind that relevance (signal) and irrelevance (noise) are subjective and contextual, yet both are subsets of the same interpretable information
- It must provide something novel: In most business context, the relevance constraint is very stringent. And under such tight constraint of relevance, novelty is very scarce
The key to insight discovery is to understand that it is much easier to find new information that you don’t already know when you venture into the irrelevant information (or what you thought was irrelevant). You may not find anything useful or valuable (due to irrelevance), but when you do, it will be novel and insightful. So my advice to all of you is to be more exploratory when analyzing your big data. Because that is how you discover insights and innovate.
OK, hopefully this post is insightful. Meanwhile, let me know what you want to hear next time, or we can further the discussion of this topic below.
Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.
You must be a registered user to add a comment here. If you've already registered, please log in. If you haven't registered yet, please register and log in.