Science of Social blog

How Much Information is in Your Big Data and How Can You Measure It?

By MikeW

How Much Information is in Your Big Data and How Can You Measure It?

by Lithium Guru ‎09-18-2012 09:30 AM - edited ‎02-01-2013 11:57 AM

Last time, I explained the subtle distinctions between data and information. I also looked at 3 examples illustrating the data-information inequality, which in most realistic scenarios (especially for big data) is: information << data.

 

As the value of any data is only as valuable as the information and insights we can extract from it, it is easy to understand why people often exaggerate the value of big data. Naively, most people simply equate the amount of data with the amount of information – hence big data would meant lots of information and therefore extremely valuable. However, this naive assumption is wrong! The data-information inequality tells us that information (the valuable stuff) is usually much less than the sheer volume of big data.

 

Today, we will take a more quantitative look at the data-information inequality. So I’d recommend doing a quick read of the prequel to this post if you missed it last time.

 

measuring info350.pngQuantifying Information

If you stare at the data-information inequality long enough, you may start to wonder how we measure information. In order for this inequality to hold, we must be able to quantify the amount of data as well as the amount of information and compare them. It is pretty straight forward to measure the amount of data you have, because that is just the storage volume of the data. If you are really dealing with big data, then the data would be on the order of hundreds and thousands of terabytes. But how can you quantify how much actual “information” is in the data you have?

 

Let me first clarify that the amount of information in a data set has nothing to do with whether that information is what you want. Saying a data set has no information when it doesn’t contain exactly the information you need is a little like saying “I can’t find my fish in the ocean, so there is no fish in the ocean.” That would be a very self-centered and narrow view of information, and frankly, plain wrong. So just because a data set doesn’t have the information you want, it doesn’t mean this data set has no information.

 

The key to measuring information lies in the redundant nature of most data sets, because information is only the non-redundant portions of the data. So if we can remove ALL the redundancy in a data set, then what’s left should be the non-redundant portion of the data (i.e. information). The challenge, of course, is how do we identify ALL the redundancy within a data set?

 

This is a nontrivial problem, because statistical redundancy (a.k.a. entropic redundancy) is not the same as duplicate data. Duplicate data is only the most obvious kind of redundancy, while there are literally infinite ways that data can be redundant. For example, the fact that I am Asian, and I have black hair is statistically redundant, because one fact is not completely independent of the other. There are many ways that statistical redundancy can arise in data, some are totally not obvious, and some even cryptic. You see the challenge?

 

Information and Data Compression

Recall from my previous post that any data (especially big data) has lots of built-in redundancies. Compression algorithms often leverage the redundancy in the data to compress them. A lossless compression algorithm works by removing the redundancy in data and storing only the non-redundant parts (which is usually much smaller), so the size of the compressed data file is reduced. However, the compressed data file has all the non-redundant data, so the remaining (redundant) data can be reconstructed to recover the original data in its uncompressed form.

 

If you have an image (say 1000 x 1000 pixels) of a blue sky (Figure 1a) how much data is it? Well, you can download Figure 1a and let your operating system tell you its file size, or you can reason with me here. This image has 1000x1000 = 1 million pixel, each having 3 color channels (red, green and blue). Since each channel is 8 bits deep (i.e. 1 byte), that is a total of 3 million bytes of data. Since a 1 MB = 1024x1024 = 1,048,576 bytes, this image would be 3,000,000 ÷ 1,048,576 = 2.86 MBs if it’s stored as a raw bitmap. However, it compresses very well (via PNG compression) to a fairly small file size of 940 kB (Figure 1b) because there is a lot of redundancy from one pixel to the next (i.e. most of the pixels are blue).

 

Birds+BlueSky_raw.png   Birds+BlueSky_png.png   NewYorkCity_1x_raw.png   NewYorkCity_1x_png.png

 

Now, if you have another image (same dimensions) of a city scene that is very complex with a lot of buildings and people in it (Figure 1c), how much data does it contain? Since this image is exactly the same size as Figure 1a, it contains exactly 2.86 MB of data (same as Figure 1a). Don’t believe it? Download Figure 1a and 1c and see what your operating system tell you. Now, since Figure 1c has less redundancy from one pixel to the next (i.e. the pixels have many different colors), it won’t compress as well (via PNG). Consequently the compressed size of the city scene is 1.73 MB (Figure 1d), much larger than the blue sky. This is a reflection of the fact that the city scene actually contains more information (less redundant data) than the blue sky.

 

data compression.jpgAs you can see, the compressed size of an image is a pretty good proxy to the amount of information in the image. The amazing fact is that this principle applies to all data (not just image data). Even though this is only a proxy, it is often good enough to estimate the amount of information in a data set. Since most compression algorithms are not optimal, the compressed file size is actually a conservative estimate of the extractable information within any data set.

 

So to measure how much information you can derived from your big data, you can simply take your most powerful lossless compression algorithm and compress your big data to its smallest possible size. You can even use several lossless algorithms because different compression algorithm Since the lossless algorithm will not lose any information, the compressed size is the maximum amount of information you can possibly extract from your big data. Isn’t that cool?

 

I bet you wouldn’t think of using compression algorithms to measuring the amount of information in a data set. I certainly wouldn’t. And that is the genius of Claude Shannon who invented information theory, which enabled the entire mobile industry. And he did that by noticing this bizarre connection between compression and information content.

 

An Intuition for the Data-Information Inequality

Now that we can use compressed size to measure the amount of information that is extractable from a data set, let’s try to get some intuition about this measure. If I have 1 kB of information, how much information (not data) is that? We have a pretty good intuition about how much data is 1 kB. It’s not much, since each text character is 1 byte, 1 kB is only 1024 characters, roughly 150+ words, or 1 paragraph of text. But we have little intuition for how much is 1 kB of information. Remember: data ≠ information.

 

Surprisingly, 1 kB of information is actually quite a bit of information, literally... Because a single piece of information is one independent fact, or 1 bit of information. So 1 kB = (1024 bytes) x (8 bits/byte) = 8192 bits of information. That is actually a lot of information. I don’t even know if I can write down 8192 facts about myself, let alone independent facts (i.e. facts that can’t be inferred from other facts). For example, my birthday and my age are not independent facts, even though they are two different facts about me (clearly, one can infer my age from my birthday).

 

If you have 1kB of information (not data) that is supposed to help you make a binary decision between buying an iPhone or an Android (see Big Data Analytics: Reducing Zettabytes of Data Down to a Few Bits), then 1 kB of information may be 8192 independent facts about these smart phones or 8192 independent comparisons of each smart phone’s features. I bet you can’t even name 100 independent reasons why you got your smart phone. So don’t underestimate a mere 1 kB of information.

 

data vs info balance350.pngConclusion

Is big data is valuable? Yes, definitely. But its value is highly exaggerated, because the amount of entropic redundancy in data is huge, therefore the amount of information we can extract from data is tiny. We can measure the total amount of extractable information (through any means) via compression, and information is usually many orders of magnitude smaller than the raw data. This is not a trivial connection, and one that has enlightened even the greatest minds of this century.

 

However, information is measured on a totally different scale than data, even though they are measured in the same units as data (i.e. in bits, bytes, kB, MB, etc.). For example, 1 kB of data is nothing in term of modern day storage, but 1 kB of pure information is actually a lot, more than most human brains can handle. The sheer volume of raw data that is needed to derive 1 kB of information may be up in the range of gigabytes. This difference is like a pound of steel vs. a pound of cotton: they weight the same (both 1 lb), but it’s going to take a lot more cotton (volume) to weight 1 lb.

 

Alright, I hope this post gives you a little more appreciation to the depth behind the data-information inequality: information << data. Yet, it is simple and elegant, like Einstein’s famous mass-energy equation: E = mc2. Next time we’ll try to understand why is there so much redundancy in data (any data). So stay tuned!

  


 

Michael Wu, Ph.D.mwu_whiteKangolHat_blog.jpg is 927iC9C1FD6224627807Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.

 

Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.

 

Comments
by Larry Irons(anon) on ‎09-18-2012 01:51 PM

Hi Michael,

 

I was following this series with some degree of puzzlement until this post. Once I read it I suddenly remembered why your approach seems to miss some important aspects of the distinction between information and data, even though it may in fact be an accurate depiction of the quantitative relationship between the two constructs. I pulled my copy of Michael Dertouzos and Joel Moses' edited collection, "The Computer Age: A Twenty-Year View" from 1980 off the shelf and located Daniel Bell's essay, The Information Society. In it Bell asserts the following:

 

"However true it may be as a statistical concept that information is a quantity, in it broadest sense - to distinquish between information and fabrication - information is a pattern or design that rearranges data for instrumental purposes, while knowledge is the set of reasoned judgements that evaluates the adequacy of the pattern for the purposes for which information is designed. Information is thus pattern recognition subject to reorganization by the knower, in accordance with specified purposes . What is common to this and all intellectual enterprises is the concept of relevant structure" (p. 171).

 

In other words, information is more than a statistical quantity even though one can precisely characterize it as such.
 Bowker and Star probably characterized this paradoxical quality of information best in "Sorting Things Out" when they noted that, "One person's noise may be another's signal or two people may agree to attend to something, but it is the tension between contexts that actually creates representation...This multiplicity is primary, not accidental nor incidental" (p. 291).

 

Anyway, I just thought I'd throw in a somewhat different angle on the dialogue you've started.

by Lithium Guru ‎09-18-2012 10:30 PM - edited ‎09-18-2012 10:44 PM

Hello Larry,

 

Thank you for the comment + all the retweets.

 

I really appreciate you taking the time to offer a somewhat different angle on this dialogue. But, believe it or not, it is actually the same angle. I just didn’t have the time and space to fully explain the full picture about how data, information, interpretation, signal and noise all work together yet. But it will come. Eventually the theory will present itself beautifully and elegantly that explains all the confusion that you are having now.

 

But let my attempt to clarify the missing link between the several different point of view. I believe the problem is that information is an overused term. There are actually many different kinds of information. First, the entropic information (the kind of information that I am talking about here) is ALL the data that anyone can get out of a data set. A subset of these are interpretable information, that human can interpret and understand, then there are uninterpretable information, that human can’t understand.

 

An example of uninterpretable data: 10 20 25 38 54 48 23 47 63 85 24 39 59 72. Do you know what these numbers means? I doubt it. I don’t know either. They can represent the number of retweets I got each day, or number of red car I saw each week, or whatever. They are data, and it does contain some information, but we can’t interpret the information that is in this little data set.

 

The information that Bell talks about is only the interpretable information. Within the interpretable information, there is signal (the relevant information, the info you want) and there is noise (the irrelevant information, the info you don’t want). All these are information and they are subsets of the total amount of extractable information (i.e the entropic information).

 

Interpretable information is subject to interpretation by individual, and hence “One person’s noise may be another’s signal” as you mentioned. That is a class concept that is very well known, totally not novel at all. Moreover, I already talked about it in a previous post (see If 99.99% of Big Data is Irrelevant, Why Do We Need It?). Entropic information include all the possible context that one can switched and all the different interpretation that a data set can have. This may seem like an infinite set, but it is not. Entropic information is well defined and quantifiable.

 

So you see, Bell’s angle is really the same angle. He’s just talking about an even smaller subset of the entropic information that I am talking about. The problem is that to get from data to the actual useful information that people can help people make decision requires many data reduction steps. The result of each step is a smaller subset of the original data. The useful data at the end is usually not that big at all.

 

So really, the pictures looks like:

 

Data >> [ entropic information (total extractable information) ]

            = [ uninterpretable info + interpretable info ] 

          >> [ interpretable info ]

            = [ signal (relevant info) + noise (irrelevant info) ]

          >> signal (relevant info, info you want, info you need, context specific, etc.)

 

And only signal (the relevant info) is context dependent, because what is relevant me may be very different from what's relelvant to you. Moreover, what's relevant to me yesterday may be very different from what will be relevant to me tomorrow, because between yesterday and tomorrow, the context can change, even within the same person.

 

See the bigger picture? I certainly hope I clarify more than I confuse.

If this is still unclear, please let me know and I’m happy to discuss in greater depth.

Thanks for the conversation and I hope to see you again next time.

 

by Kane See(anon) on ‎09-20-2012 11:57 AM - last edited on ‎09-20-2012 05:37 PM by Lithium Guru

This is a great article! 


And I would agree that the value of data is highly exaggerated. Who can really make any sense of so much data anyway? 


I believe the real value is in the information that is extracted, condensed and sometimes even augmented through analysis from the Big Data. It's something that my company tries to do through record linkage, entity consolidation and social network analysis. 


I think it's inevitable that as we encounter a greater amount of web-scale data, companies will need a way to make sense of it all without going through each of its billions of records and data points individually.

 

by Jan-Kees(anon) on ‎09-20-2012 03:20 PM - last edited on ‎09-20-2012 05:40 PM by Lithium Guru

Great article and series.

 

Ultimately it is about finding signals. Something that humans have been great at for eons.

 

I think we are just in our modern lives struggling with our own newly created environment of data-data everywhere. The male SAN of South-Africa have no trouble reading very subtle 'signals' to help catch that one rare prey animal, while the women can trace edible roots following other kinds, and in both cases  these are so subtle we from the west cannot notice them. Most of us would die within a very short time if left to our own in the Kalahari.

 

The interesting part is we create our own Kalahari, that of data, which unlike the desert is in a state of flux always. Our activities create our own noise, where are the signals, how to read them? The Human brain evoluated for powerful signal to noise filtering capacity using visual and audible cues, against a backdrop of olfactory (ferromones travel reasonable distances) and touch and taste for near distance. So how do we navigate a data world constructed by ourselves?

 

Of the senses most is left to visual, maybe some audible, but other senses are not served by data signals in any way. Now add that the visual cortext is actually not stimulated much at all.  Maybe a table or data visualization with 400 datapoints at a time. We read and process text and numbers not in the visual cortex but in a very small section of the brain, and in that section most happens sequential.

 

So is it thinkable we can get the brain going on a data processing that is as parallel as the processing of visual impressions as they would come to our kinsmen that walk the Kalahari desert?

 

by Lithium Guru ‎09-20-2012 05:36 PM - edited ‎09-20-2012 05:37 PM

Hello Kane,

 

First, glad you like this post, and thank you for taking the time to comment on my blog. I only edited to put in some linefeeds.

You are right on. In later post, I will try to cover some of the most common ways to make sense of the large amount of web-scale data.

 

The key is really just data reduction. When you do any kind of aggregation, you are taking a large amount of data and reduce it down to a small summary. For example, when you count the number of entries in your data set, you taking a lot of number to produce 1 number. The number may be very big, in the billions, but it is still one number. Likewise, when you average, or compute other statistics from the data. It is through these data reduction that you interpret and get information from the data you have.

 

Raw data, especially big data, are really not for human consumption unless they are digested by analysis, model and visualization.


I hope you will share some insights from your company when I get to write about the later posts on data reduction.

Thanks again for commenting here.

And I hope to see you again on Lithosphere.

 

by Lithium Guru on ‎09-20-2012 06:01 PM

Hello Jan-Kees

 

Thank you for the comment, I only edited it to put in some line feeds to make it more readable.

 

Thank you for point out that our brain is evolved to process data and pick out signals from the noise. In fact, I was actually a biophysicist during my PhD days, and my thesis focuses on modeling how the human brain process visual information. So I can definitely appreciate your comment.

 

Even though our brain is not evolved to process digital data that we create, it is still a superb nonlinear processor for data that reflects the statistics of nature. But we are very bad at picking out signal that do not reflect these natural statistics. For examples, two white noise will sound very similar to you even though there are more difference between them than two different sentences (e.g. "Hello there" vs. "How are you").

 

Because raw data is so indigestible, we often need analysis and data reduction and data visualization to help us understand data. As data visualizatioon become more powerful, it is definitely thinkable that we will learn to process data in a more parallel way through visual inspection. 

 

Allosphere is a project in UCSD that maps data to visual patterns as well as auditory patterns, so we can not only see, but also hear the data. I believe these type of data exploration tools will help us understand complex data much faster. And it will enable us to pick out signals much better and much faster from more complex data.

 

Anyway, thanks for the comment and see you around next time.

 

by Oleg Okun(anon) on ‎09-24-2012 01:03 PM - last edited on ‎09-24-2012 08:23 PM by Lithium Guru

However, one shouldn't fall into another extreme that Small Data is enough for practical purposes as Small Data means the shortage of information/knowledge.

 

The business value of data is not determined by its size but rather the amount of useful information contained in it and the easiness which this information can be extracted with.

 

The compression rate alone (ie without other possibly unknown factors) might not provide the complete picture about information embedded into data. For instance, let us compare two images of the same scene: one without any noise (ie absolutely clean) and the other with heavy noise. In the first case, the compression rate is likely to be higher than that in the second case, given that the same compression algorithm was used in both cases. However, does it mean that the second image is more "informative"?

 

by Lithium Guru on ‎09-24-2012 09:10 PM

Hello Oleg,

 

Thank you for posting your comment here. Much appreciated.

 

You are right on. We definitely don't want to convey that small data is sufficient, and that was not my intent.

As you said, the value of data is not determined by size, but the amount of useful info in the data. And all I want to emphasize is that information is not proportional to data. In fact there is always a diminishing return. I like the fact that you pointed out that the easiness of information extraction is also important too. That is often overlooked.

 

Having the information is not sufficient, it must be easily extractable in order to be useful.

However, I would argue that the noise added to an image is information too. They are just not the information you want, b/c it's not valuable to you. But I can see that camera engineers or CCD sensor engineers would use those noise to learn about the property of the noise to develop signal processing algorithms to correct for their CCD sensors. It's the classic saying that one man's noise is another man's signal. You can take a look at an earlier post that discuss this issue: If 99.99% of Big Data is Irrelevant, Why Do We Need It?

But thanks for the great comment. You point is excellent.
Thanks and see you again next time.

 

by Sushanta Pokharel(anon) on ‎09-24-2012 10:35 PM

I absolutely agree to your paper and specially the part about information in big data always being lesser than the data present. As you have pointed out there is a lot of redundancy there.

However, you are only considering one aspect of big data technologies, amount of information stored. In fact many of the big data technologies like the Big Table, and HBase promote high redundancy. This is done in order to reduce information retrieval time, which is another big aspect of big data in my opinion.

Consider that in Relational Database Design removing as much redundancies as possible, to maintain system integrity and maintainability is a major focus (normalization). This also means processing time increases as amount of information increases, in general. Many of the big data technologies are in favour of precomputing results with certain frequency and storing it. Of course this also means increase in redundancy as the precomputed results does not increase the absolute amount of information present in the system, it just saves you from computing it every time.

You will have to agree that as new data is put in, amount of information generally increases, although disproportionately. Consider that even if a same user gets added to facebook twice information like how many accounts a user has changes.

of course I am assuming that by 'big data' we are talking about current big data technologies and its uses here.

by Lithium Guru on ‎09-25-2012 02:07 PM

Hello Sushanta,

 

Thank you for taking the time to post your comment here. Greatly appreciate it.

 

This is an excellent point!


You are absolutely right about the apparently inverse relationship between redundancy and information retrieval time. In fact, I was just thinking about writing a post to clarify that redundancy are not necessarily a bad thing. As you pointed out, increased redundancy can reduce the info-retrieval time. 

The redundancy in data is actually a reflection of the correlation that exist in nature. If we actually remove all redundancy, then the data won't reflect the reality anymore. Then the information we extract from it might become less useful. So I like to re-emphasize that I am not saying that redundancy is bad. But it does limit the amount of information you can extract from the data. It is just an intrinsic property of data and information that we need to understand to make good judgment about it and make good use of it. 

As you pointed out, adding new data generally increases information, although the increase is disproportional. In most realistic scenario, there is always a diminishing return on the amount of data (see The Big Data Fallacy: Data ≠ Information).

Thank you again for such a great comment.

Hope to see you again next time.

 

by Mustafa M(anon) on ‎10-03-2012 05:57 AM - last edited on ‎10-03-2012 07:51 AM by Lithium Guru

Hi Michael,

 

having managed data, both structured and unstructured it does not come as a surprise to me that data >>> information. Your blogs and article really help add more details to that knowledge.

 

That said, I wonder how we can really make use of this understanding ? Does this revelation lead to better data gathering and storage techniques ? or better search algorithms within big data ?

 

How are the companies like Twitter, FaceBook and Google managing this big data explosion ?

by Lithium Guru on ‎10-03-2012 08:53 AM

Hello Mustafa,

 

Thank you for taking the time to comment on my post here.

 

I'm glad to hear that you agree and observe the data-information inequality as part of your work. The main point of this post is to help business folks and non-data savvy people understand the difference between data and information. Because there are much hype and exaggeration around the value of big data. If you are managing data, then you are probably already making use of this understanding. 

 

For example, understanding the redundancy in data will definitely help us build a better storage engine. Many column stores are essentially leveraging the redundancy, and therefore the compressibility of the data within a column to build more efficient storage and retrieval system. It will certainly help with algorithms within big data. For example, when you write pig scripts, your data source can be compressed to reduce storage volume.

 

However, this knowledge probably won't help with your data collection mechanism, because that is more dependent on the measurement device. 

 

In terms of how companies like Twitter, Facebook and Google manage this big data explosion, they all have different strategies depending on what they are trying to optimized. If they are trying to optimized storage and retrieval then they can compress the hell out of the data. If they try to optimize for query time, they can sometimes even artificially introduce redundancy to achieve faster query time. But they key theme in all of their strategies is all about parallel processing and pre-computed  and stored intermediate features for further calculation.

 

I hope this help.

Thank you for the comment and I hope to see you next time on Lithosphere.