Data Science

The Big Data Fallacy: Data ≠ Information

I know; it’s been a while since I blogged. This past weekend, I actually received a tweet asking “are you dead?” It’s a little dramatic, but it’s quite true. I’ve been kind of dead on social media for about 4 months, even though I am still alive in person. I’ve just been too busy doing other works, which I will tell you later, but that’s not an excuse for not writing.


To prove that I’m not dead, I will restart blogging again! Today, I want to pick up where I left off with my mini-series on big data, so let me jump right in because I want to make up for lost time.


Data ≠ Information

Today, we are going to talk about data and information and the difference between them. Although they are different, many people speak of them as if they are synonymous, which is almost never true. However, the difference between data and information is quite subtle, so let’s try to understand it.


Data is simply a record of events that took place. It is the raw data that described what happen, when, where, how, who’s involved, etc. Well, isn’t that informative? Yes, it is!


Data does give you information. However, the fallacy of big data is that more data doesn’t mean you will get “proportionately” more information. In fact, the more data you have, the less information you gain as a proportion of the data. That means the information you can extract from any big data is asymptotically a diminishing return as your data volume increases. This does seem counterintuitive, but it is true. Let’s see if we can clarify this with a few examples.


Example 1: data backups and copies

If you look inside your computer, you will find thousands of files you’ve created over the years. Whether they are pictures you took, emails you sent, or blogs you wrote, they contain certain amount of information. These files are stored as data in your hard drive, which takes up certain storage volume.


Now, if you are as paranoid as I am, you will probably back up of your hard drive regularly. Think about what happens when you backed up your hard drive for the first time. In terms of data, you’ve just doubled the amount of data you have. If you had 50 GB of data in your hard drive, you would have 100 GB after the back up. But will you have twice the information after the back up? Certainly not! In fact, you gain NO additional information from this operation, because the information in the backup is exactly the same as the information in the original drive.


This happens at the file level too. Each of the thousands of files in our computer contains some fixed amount of information. If you made 100 copies of a file in your computer, you will increase the amount of data in your hard drive by 100x the size of your original file. Yet, the amount of information you gain is zero.


Although our personal data is not big data by any means, this example illustrates the subtle difference between data and information, and they are definitely not the same animal. Now let’s look at another example involving bigger data.


Example 2: airport surveillance video logs

Firstly, video files are already pretty big. Secondly, closed-circuit monitoring systems (CCTV) in an airport are on 24/7, and high definition (HD) devices increase the data volume further. Moreover, there are hundreds and probably thousands of security cameras all over the airport. So as you can see, the video logs created by all these surveillance cameras would probably qualify for big data.


Now, what happens when we double the number of camera installations? In terms of data volume, you will again get 2x the data. But will you get 2x the information? Probably not! Many of the cameras are probably seeing the same thing, perhaps from a slightly different angle, sweeping different areas at slightly different time. In terms of information content, we almost never get 2x. Furthermore, as the number of cameras continues to increase, the chance of information overlap also increases. That is why as data volume increase, information will always have a diminishing return, because more and more of it will be redundant.


A simple inequality characterizes this property: information ≤ data. So information is not data, it’s only the non-redundant portions of the data. That is why when we copy data, we don’t gain any information even when the data volume increase, because the copied data is redundant.


Example 3: updates on multiple social channels

What about social big data, like tweets, updates, and/or shares? If we tweet twice often, twitter is definitely getting 2x more data from us. But will Twitter get 2x the information? That depends on what we tweet. If there is absolutely zero redundancy among all our tweets, then Twitter will have 2x the information. But that typically never happens. Let’s think about why.


First of all, we retweet each other. Consequently, many tweets are redundant due to retweeting. Even if we exclude retweets, the chance that we coincidentally tweeting about the same content is actually quite high because there are so many tweeters out there. Although the precise wording of each tweet may not be exactly the same, the redundancies among all the tweets containing the same web content (whether it’s a blog post, a cool video, or a news article) is very high. Finally, our interest and taste for content are not random; they remain fairly consistent over time. Since our tweets tend to reflect our interests and tastes, even apparently unrelated tweets from the same user will have some redundancies, because the tweeter’s interests and tastes are the same.


Clearly, even if we tweet twice as often, Twitter is not going to get 2x the information because there is so much redundancy among our tweets (likewise with updates and shares on other social channels). Furthermore, we often co-syndicate contents across multiple social channels. Since this is merely duplicate content across multiple social channels, it doesn’t give us any extra information about the user.



We’ve seen three examples that illustrate the subtle difference between data and information. Although data does give rise to information, they are not the same. Information is only the non-redundant parts of the data. Since most data, regardless of how it is generated, has lots of built-in redundancy, the information we can extract from any data set is typically a tiny fraction of the data’s sheer volume.


I refer to this property as the data-information inequality: information ≤ data. And in nearly all realistic data sets (especially big data), the amount of information one can extract from the data is always much less than the data volume: information << data.



As a data scientist, I certainly recognized the importance of big data. Moreover, data is the foundation and key to most of my research. But even as a data scientist, I must confess that the value of big data is really overrated, because the value of big data is in the information that it can provide. And information is only the non-redundant portions of the data, which is a tiny and diminishing fraction of the overall data volume.



Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the dynamics of communities + social networks.


Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.


  • Big Data

Trying to equate "data" with "information" is not really a relevant discussion. Information is all in the eye of the beholder. In the social example you gave, maybe I am interested in knowing how user's retweet. Not sure why, but who knows maybe somebody cares about this and for them this is information. Same with the multiple surveillance camera scenario. The multiple angles may contain valuable "information" to someone but on a typical day it just just duplicate data. 

I think we are going through a cycle where capturing and trying to make sense out of vast volumes of data (social data, sensor data....etc) is becoming more economical and somewhat mainstream with respect to technology and tools. However, this is a cyclic I believe, at some point business will realize that maybe they are getting diminishing returns on all this data they are capturing and storing. For example, do I really care what I tweeted 20 years ago (20 years from now). I probably will never have the time to go back and look at it and I am not sure it is valuable to any marketing person (but who knows). 

There is definitely gold to be mined in many data sets that now go untapped and technologies like Hadoop, BigQuery, Storm to name a few are good tools to use but I have to agree that not everything fits into the Big Data tent either. 

There has been a lot of hype around Big Data these days and I see a lot of people trying to fit problems that really have no reason being shoe horned into Hadoop, other than it being the cool thing to do. You could do the data crunching in easier ways for example. However, the tool sets are expanding to give developers, scientist and business people more options when deciding how to store and analyze their data. 

When thinking of Big Data first ask yourself the following question: 

1) How much data do I want to capture and store (do you need persist detailed records/data?)
2) How fast is this data being created (velocity).
3) How long do I want to keep it (forever??) 
4) How long am I willing to wait to get "information" when I run my analysis (batch/hourly/daily or realtime). 

This might help you determine in which of the particular emerging Big Data technology buckets your problem best fits.

Data Science

Hello Sam,


First, thank you for commenting on my blog.

What you are describing, the information that are in the eyes of the beholder, is actually the difference between signal (information that you want) and noise (information that you DON'T want) rather than the distinction between data and information. Unlike signal and noise, which is rather subjective, because it depends on the problem that you are trying to solve and the models that you use, information actually has a very rigorous definition in statistics.

The information that I am talking about is the data entropic information (a.k.a. Shannon information) in the statistics. This concept of information is first defined by Claude Shannon in his ground breaking Information theory, which revolutionized the entire field of communication and enabled wireless communication that we often took for granted now. And this entropic information is related to the information in layman's language in the way that it is an absolute upper bound on the amount of information tha people can extract from the data. So regardless of whether they are useful info (signal) or useless info (noise), which is the eyes of the beholder, the entropic info of a data set is the maximum amount of information anyone can possibly extract from the data set with any method or technologies. 


So information (in statistics) is definitely not in the eyes of the beholders. Given a data set, the entropic information can be computed and has a well-defined value. But signal (as opposed to noise is the relevant and useful information) is all in the eyes of beholders (see If 99.99% of Big Data is Irrelevant, Why Do We Need It?) And that is even smaller than the absolute entropic information that we are talking about in this article. So the data-formation inequality should say: 


Signal (the relevant and useful information) << Information << data

What you said about the diminishing return on data will definitely happen even if it is not due to data aging, because the more data you collect, the less information (in statistics) gain, since the more data you have, the greater the probability for redundancy to occur. 


BTW, information theory is very powerful and can help you extract information and signal (i.e. the relevant information) from big data and make sense of big data. But most people who only focus on the technology, such as the SMAQ stack and those you mentioned, just don't know much enough about it.


There is definitely much hype about big data, that is why I started this mini-series on big data. And since I do much research and statistical modeling with big data, there will be more to come.

I hope you will enjoy the rest of this mini-series. See you next time.

It's good that you've drawn attention to the fact that data isn't worth much by itself, but I think it's very difficult to gauge the value of raw data and decide what to keep or discard.


An example arises within one of your own cases: You might not see retweets as valuable because the content mostly duplicated data, however it might be important if you wanted to decide what time of the day is best to share a product announcement to maximize its virality.


One of the biggest benefits of Big Data technologies is that we can store first and ask questions later.

"data >> information" is a really cool thought, and I love the idea of a law of diminishing returns with data size.



But I suppose the size of the data might in itself be a valuable piece of information.


Here is why ...


1.  By counting Tweets, I can find out how many people something is affecting, and who they are.


2.  By using more data in estimating trends, I may be getting more accurate estimates.  When I have 2x data, I am performing a coin toss experiment 2x times and consequently getting a more accurate reading of the statistical parameters.


With that caveat, I agree with you wholeheartedly!  :^)


Cohan (Aiaioo Labs)

Cohan Sujay Carlos


2.  By using more data in estimating trends, I may be getting more accurate estimates.  When I have 2x data, I am performing a coin toss experiment 2x times and consequently getting a more accurate reading of the statistical parameters.





I might argue that your point #2 actually cuts the other way and confirms the OP. Think about it this way: if you flip a coin 10 times and measure heads and tails. You might finish with 6 heads and 4 tails (a 3 to 2 ratio). Obviously, if you flip it 1,000 times more (more data as you note), you will begin to approach a 1 to 1 ratio. Your information will indeed improve for awhile but eventually it begins to approach infinity. So, would flipping the coin a billion times provide much more information than flipping it one million times? It may provide a minute amount of information (and maybe in some cases that minutae is worthwhile but not really in this example). As you approach the limit of the 1 to 1 ratio, the information << data.

More quantity of data also means more low quality data if it is gathered ad hoc. If the data comes from a process (example: modern jet engines produce big data quantities on long flights but it is all consistent while capturing social media conversations means the data is extremely messy, inconsistant and full of, frankly, questionable data points.)


Great article. Thanks for posting it! I've started following you on Twitter!

Data Science

Hello Jon Davey,


Thank you for the comment. 


You are totally right that that one of the advantage of the big data technology is that you can store first and ask the question later. There are couple of earlier posts I wrote that addresses your comment

     1. If 99.99% of Big Data is Irrelevant, Why Do We Need It?

     2. Searching and Filtering Big Data: The 2 Sides of the “Relevance” Coin


To answer your question about retweets. First, redundancy or redundant data is NOT the same as irrelevant data. And redundancy is also NOT the same as exact copies or duplicates. In statistics and information theory, there is a very precise and rigorous definition of what redudancy is. So when I say retweets contain redundant data and therefore do not contribute additional information, it doesn't mean they are not valuable.


In terms of twitter data, no retweets are exact the same, even if the textual portion of the tweet is identical, because the tweeter is different, and they retweeted a tweet at different time. These are the NON-redundant portion of the data. So we can learn something and gain information from these non-redundant part of the retweet data. And they do contribute to the overall information content of the data set. But text portion of the tweet that is identical between each of the retweets from different user at different time, location, etc, are the redundant portion of the data set. And we will not gain more information from those parts of the data.


See the difference? Even retweets that look exactly the same have some non-redundant portions (in addition to the redundant portions). Only the non-redundant portion contributes to the overall information content of the data set, the redundant part don't add information.


All data have built-in redundancy. The mere fact that we are tweeting in English means that the data will have SOME redundancy, because we will use the same common words, and we all use the English alphabets. But not ALL of it is redundant. Again, if you see my response to Sam (above), you will understand what precisely I meant by redundancy, these are statistical and entropic redundancy, not the fact that they are merely duplicates. To be rigorous, even exact copies of data have some non-redundant parts from the mere fact that they are in two different location. I will think about how to illustrate this abstract concept of entropic redundancy and Shannon information in a later article.


But in the mean time, I hope this clarifies your confusion or misunderstanding.

Hope to see you next time.


Data Science

Hello jscranton and Cohan Sujay Carlos


First of all, thanks for the comment and discussion.

Now, let me try to address both of your questions and discussions together here since they are related.


So the question is, does more data lead to more accurate statistical estimate? Despite the common believe of most people, the surprising reality is that this is NOT always true.


It really depends on your statistical model, and your data. If you are merely computing simple statistics, such as average, then it is USUALLY true. But not ALWAYS, because when we compute the arithmetic mean by averaging, we usually assume the underlying distribution of the data is Gaussian, or at least unimodal. These assumptions may not be true.


One of the classic example is that a group of scientist is trying to get an average high of certain family of cactus in desert. And they different groups collecting data from deserts all over the world and naively computed the average, which leads to a completely wrong answer. The reason is due to the bi-modality of the cactus data. It turns out that the data they collected from Africa were all very tall cactus: about 7-8 ft tall with standard dev of 1 ft. But the data they collected from South America were very short: about 1 ft tall with standard dev of about 0.5 ft. The average they get is about 4 ft +/- 2 ft. But there is not a single cactus around the world from that family that is roughly 4 ft tall. They are either ~1 ft or 7.5 ft tall, never 4 ft. Very wrong answer.


There are a lot more of these counter examples that I can give you, but I want to point out other situations where this more data is not always better.


More data doesn't mean better estimate, becase it also depends on how complex your statistical model is. If you use some nonlinear models, then more data can sometimes hurt your estimate because you may actually introducing more noise into the system by using more data. And if your model nonlinearly amplified those nois, then more data actually give you a worse estimate. I did an analysis on our social media data trying to predict election outcome. And had to discard unstable data to reduce noise in order the get a more accurate estimate. If you are interested, please check out the follow blog post AND the discussion:

     Big Data, Big Prediction? – Looking through the Predictive Window into the Future


Finally, I want to point out again that the size of data is merely one bit of implicity information, it is included in the estimate of the data-information inequality. The entropic information that I am talking about here (see my resply to Sam and reply to Jon Davey above) is a statistical upper limit on how many question you can answer with a data set of certain size. That information includes how big the data set is, how many people are retweeting who, etc. that are useful, as well as information that are not as useful such as 3057 people tweeted using 1 word, or even less useful information. As I said in my reply to Sam, regardless of whether the information is useful or not, they are informaiton. I call relevant info Signals, and irrelevant info Noise. But they are both information that is extractable from the data set.


OK, I think this reply is getting long. So I hope I clarify some of the more subtles and abstract concepts that may be less familiar. But the main point of the post is just to point out the fact that data is NOT the same as information. And the amount of info you can derive from any data set is always much smaller than the data set itself. If you get that much, I'm happy. I can dive deeper into the more technical topics later.


Data Science

Hello Peter Mancini


Thank you for commenting and following me on twitter.


You are absolutely right. Social media is very messy and that is definitely a challenge that I have to deal with every day. That is why before I analyze any social media data, I always have to preprocess it to make sure that I am analyzing the relevant data to maximize the accuracy of the result. That is why search and filtering technology is so important.


And good point about consistency. That is precisely the reason that I have to discard old data end up introducing more noise to my election forecast model, which will ultimately reduce the accuracy of the forecast (see Big Data, Big Prediction? – Looking through the Predictive Window into the Future and the discussion under that article).


We are definltely thinking on the same line. here.

Thanks again and see you next time. In the mean time, feel free to check out other big data articles!




What other Big Data Fallacies do you see?  Perennial favorites from client work, include:

  1. Because technologists can collect at increasingly lower unit cost, businesses should fund the speculative accumulation of more data en masse (i.e., covering the largest scope possible, and not merely to establish a defined research set);
  2. Enabled by enough observations, each comprised of enough data elements, the business (i.e., a proxy for that intelligence capable of converting 'Data' into valuable 'Information") will divine the best analytics ex post facto;
  3. Datum combined with like datum and transformed into information are (always) more valuable;
  4. Since today's technologies only produce data of type X, the fact that a decision-making need actually requires analyses of the tuple (X, Y, Z) means we should collect all the X we can and make the best of it.
  5. Within the confines of a single organization's decision-making, one needs similar volumes of Big Data for each information quanta.

As you no doubt sense, the above follow from an applied decision-making (e.g., decision-led, information-purposed, data-lagged) perspective.  Where do your favor fallacies tend to arise?  Is that different from where the most (or the most audacious) arise?


And I wonder if you see the same fallacies at work in analytic domains fundamentally involving data reduction (e.g., segmentation) so to characterize across independent actors (e.g., "buying groups") as those involving characterizing the behavior of each buying group (e.g., longitudinal models).


Kind regards,



Nice article and follow up discussion. Thanks.


How would your thoughts on big data change if one was interested in the temporal component, especially if the information had multiple time-scales? I would imagine that big data is essential to study long term evolution...  Perhaps it isn't relevant when studying social media? Or perhaps for the type of social media measures of interest, short-term stationarity assumption is sufficient? Thanks.

Data Science

Hello KP,


Thanks for the comment.


You list is very good, and I've indrectly talk about a few of them in earlier posts and/or the discussions that follow. I will just elaborate a little bit.

  1. Just because you can collect more data, doesn't mean it make business sense to collect it. Feel free to check out
    If 99.99% of Big Data is Irrelevant, Why Do We Need It?
  2. Although data is only the enabler of analytics, analytics and data are completely separate. People develop analytics and new analaytics ana algorithms on fake data all the time.
  3. There are infinitely number of ways to combine 2 number via mathematical transformation, only a very tiny (I mean really really tiny) amount of transformation is meaningful and even a tinier subset of those will be valuable.
  4. It depends if there is mutual information between X and Y, and X and Z. So you can compute the mutual information: I(X;Y), and I(X;Z), and if these are big enough, then maybe you can derive a statistical model to infer Y from X, and likewise Z from X. That is still a non-trivial step. If I(X;Y), and I(X;Z) are too small, then there is no way to extract any meaningful information about Y and Z, and you are out of luck.
  5. This is the topic of discussion of this post. As your data volume grows, each additional information quanta requres more and more data. Conversely, the information gain from big data is a diminishing return.


There are many fallacies around analytics too. Being a mathematician/statistician, there are just too many for me to write about, and most will probably bore you. In general, most of these are really just misconceptions about data and inexperience in advance statistics and machine learning. So they can definitely be considered as Big Data fallacies, but not really specific to big data per se.


For example, one common misconception about data is that more data means better and more accurate statistical estimate. That is not always true. I've discuss this in great length with examples in my reply to jscranton and Cohan Sujay Carlos above. But there are literally too many of them. Maybe I will write a few more in later posts.


Thanks again for sharing your list of fallacies. 

And I hope to see you again.



Data Science

Hello Esp (and KP, this might be relevant for you too)


Thank you for the nice comment and glad you like the discussion.


Great question. Temporal dimension is definitely an important aspect of data. In fact temporal variation, such as trends and seasonality, are some of the most common ways that people look at data. However, data is data, whether they are time series data or not. So the data-information inequality still holds. In fact one of the reason why data have so much redundancy is because information about a system typically have multiple time scales. And these time scales are usually slower than our measurement time scale. As a result, we created a lot of temporal correlation or serial correlation in the data that shows up as statistical redundancies. 


Big Data are not necessarily for studying long term evolution. They can be used in short time, but with a large population as what I did when predicting election outcome. However, it is true that sometime short-term stationary assumption is sufficient, and that depends on what problem you are trying to address. But that has nothing to do with how big your data set is. Even over a short time, you can still get huge amount of data if your sampling, or measurement frequency is high enough. Maybe you can take a look at an earlier post on: Why is Big Data So Big?


Anyway, I hope this will give you some perspective.

I will discuss the issue about temporal correlation in a later post. So I hope to see you then!


Dr. Wu, Thank you for the prompt response. Analysis of social media is an interesting domain, and there's much to be learned. -- Esp

Data Science

Hello Esp,


No problem.

Yes, social analytics is definitely a very interesting area. I am learning myself too.

I simply bring to the table some of my more rigorous background in math and stats, which seems to be missing in most of the social analytics discussions.


Thank you a lot Mike for your post and welcome back! Most e-commerce businesses today have only access to transactional data (purchase history, contact details, delivery adress). What do you think will happen if they would add social features to their e-commerce web sites (social shopping) to enable user engagement and social interactions with products. No doubt, this will increase the amount of data (Big Data) but do you think that more data (in terms of social data) will generate more revenue?

Data Science

Hello Hans,


Thank you for continue to be a loyal reader and thx for asking this excellent question.


The short answer is yes. Big Data in the context you describe definitely has the potential to drive more revenue. But just having big data alone, that is simply capturing the social interactions and conversations on their c-commerce site, is not sufficient. It depends on what you do with the data.


I can immediately think of several ways to use the combination of transaction + social interaction data that is extremely valuable. But there are probably a lot more:


  1. Business can figure out what kind of social interaction drives purchase: Since e-commerce now have both the transaction + social interaction data, they can definitely do a simple linear regression to figure out which social interaction, or what aspects of a social interaction drives purchase. Then they can strategically drive those interactions, which is most effective at driving purchase.
  2. They can predict market interest and demand: Since e-commerce would have data on how much conversation took place around each of their product (from the social interaction), they can use this data to plan and adjust their supply chain to meet the market demand.  Ultimately, this leads to better fulfillment and faster delivery of product/service to the customer.
  3. They can use the data to understand why customer purchase a product to help future product development: Since e-commerce also have access to the product specs data, they can combine this data with the social conversation data. Specifically, they can do a categorical regression or logistics regression to figure out what features or aspects of their product do consumer like and/or dislike, liked enough to purchase, etc. Then they can use these results to help them design and develop future product that consumers want, and want enough to buy.
  4. They can use conversation data to optimize their SEO: By analyzing the social conversation around their products, they can figure out the linguistics profile that people use to talk about their products. This can be done via word histogram analysis, frequent word analysis (minus all the stop words and product entities), and other topic modeling analysis. Then they can use these result to purchase adwords that can optimally drive traffic to the e-commerce site.
  5. They can use the data to segment their target audience: If the transaction data in their CRM includes the customer’s demographic data, then they can use SSO (single sign-on) to match users who participate in social conversation with their customer ID. This would enable e-commerce to segment their customers to understand their target audience. Then they can use this result to plan future marketing campaign and target their most valuable customers.
  6. They can also use the social components on their e-commerce site to support their customer and drive WOM: This is not specific to the use of big data, but if you enable customer to support each other on their e-commerce site, you can also cut support cost through call deflection. But it certainly generate more data. These data can be use by the consumers as well as the company. Moreover, users helping each other through peer-to-peer support often leads to very effective WOM recommendation. And you can definitely use the data that's generated to again predict which product consumers recommend, which product tend to require more support etc. This can help you staff accordingly to avoid poor service and bad user experience. And it can also help you drive effective WOM campaigns.


There are probably more, but this is a really great question Hans.

Thank you for asking.

And I hope to see you again next time.


I really enjoyed the post and the discussion. Will keep coming back Smiley Happy


I think your argument works well in a "big data versus very big data" scenario, but our typical scenario is more one of "some data versus big data" and here I would disagree with the point you are making.


You are making a point about diminishing-returns of data, but this I think only holds in cases where your problem has been long fixed and you **already have a huge amount of data**. Take your camera example: in some cases you have a 100 cameras and get zero information because none of it shows you the angle that you want. Then you add the 101st camera and BOWM you get the information that you wanted, huge increase in returns.


I agree that the probability of this gets lower as the number of cameras grow, but only after you have many million cameras... Another way to put it: some things cannot be measured until you reach a critical mass of data. In this case you inequality holds horribly true because basically information=0 until you reach a sufficient amount of data, then it make a jump which may be worth millions in revenue!


Hugo Zaragoza

Director of

Data Science

Hello Hugo,


Thank you for your interest and the nice comment.

I’ve edit it only for the formatting, since the paragraph / linefeeds did not appear to show.


I believe there is some misconception about what information really is. If you have been following the discussions (I know there is quite a bit), I think you might have miss my reply to Sam that describes how information is define in information theory. That is the more rigorous and objective way to look at information and data. I really recommend you take a look at that reply, and also an earlier blog post I wrote that distinguishes the information you want vs. the information you don’t want: If 99.99% of Big Data is Irrelevant, Why Do We Need It?


So let me reiterate, Information has nothing to do with whether it is relevant to you or not. That is the distinction between signal (information that you want) and noise (information that you don’t want. If you don’t get this, please see: If 99.99% of Big Data is Irrelevant, Why Do We Need It?). That is the amount of information in data has nothing to do with whether it is information you want or not. Even if the data contain nothing you want, it doesn’t mean it is has no information. Because someone else may find some useful information from it. Even if the data contain no information that anyone wants, it still doesn’t mean it has no information. Because someone in the future may find something useful in it.


A data set will almost always contains some information whether it is the information you want or not. It may be a very small amount of information (1 bit), or a lot. And that amount of information is an absolute quantity, measurable and computable from statistics and information theory. It is the absolute maximum amount of information that you or anyone can extract from the data with any means possible. This is sometimes refer to as the entropic information.


Now, back to your argument concerning the camera. If you have 100 camera and none of it have the angle you want, it just didn’t have your signal (the info you want), because to you they are all just noise (info you don’t want). Then you add 1 camera, you get the info you want, your signal (which could very well be someone else’s noise), and that info is with $1M to you. Great! So you’ve extracted $1M value from the data that is recorded from 101 camera. Now what if your boss ask you to derive $2M value. Will 202 cameras be sufficient to help you derive that $2M value? I doubt it. Even if you got luck or you are super-smart and were able to derive $1M value from just the data in one camera or from some data set of fixed size, now ask yourself, how much more data would you need to derived $2M value. That is what I meant when I say diminishing return.


So the law of diminishing information return on data volume still hold. In fact, it will always hold for any data set, because it is a proven theorem in statistics. That means it is true under the rigor of mathematical logic. It is as true as the Pythagorean Theorem, in that it is absolute. It just holds true even more strongly in cases where the problem is long fixed as you described.


I agree that some things cannot be measure until it reach certain critical mass, but that quantity is signal, NOT information. Noise is information too, but it’s just information that you don’t want, so it has no value to you, or irrelevant for the problem you are trying to address. But they are information, and may be valuable to your competitor or other random people at some random time. That information is in the data set whether you like it or not.


The discussion of what is signal and what is noise is no doubt subjective. And for a long time, mathematicians and statisticians believed that information is also somewhat a subjective concept. But that is NOT true anymore, not after Claude Shannon invented information theory, which objectify the computation and discussion with regard to data and information, and make it rigorous like other disciplines in science and engineering. And that is a great triumph of the human mind, so we should learn and move the field forward.


From the disucussion and questions that arise, it appears that a lot of people still didn’t understand what information really means. So I think I will write a bit more about that next time. However, in the meantime, I hope this is clarify some confusion for you, so you can understand this post more fully.

I certainly hope to see you again next time with great indepth discussions.



Lithy Awards 2017

Voting is now closed. Winners will be announced on June 14th!! See them here!

See the nominees!