How Much Information is in Your Big Data and How Can You Measure It?
Last time, I explained the subtle distinctions between data and information. I also looked at 3 examples illustrating the data-information inequality, which in most realistic scenarios (especially for big data) is: information << data.
As the value of any data is only as valuable as the information and insights we can extract from it, it is easy to understand why people often exaggerate the value of big data. Naively, most people simply equate the amount of data with the amount of information – hence big data would meant lots of information and therefore extremely valuable. However, this naive assumption is wrong! The data-information inequality tells us that information (the valuable stuff) is usually much less than the sheer volume of big data.
Today, we will take a more quantitative look at the data-information inequality. So I’d recommend doing a quick read of the prequel to this post if you missed it last time.
If you stare at the data-information inequality long enough, you may start to wonder how we measure information. In order for this inequality to hold, we must be able to quantify the amount of data as well as the amount of information and compare them. It is pretty straight forward to measure the amount of data you have, because that is just the storage volume of the data. If you are really dealing with big data, then the data would be on the order of hundreds and thousands of terabytes. But how can you quantify how much actual “information” is in the data you have?
Let me first clarify that the amount of information in a data set has nothing to do with whether that information is what you want. Saying a data set has no information when it doesn’t contain exactly the information you need is a little like saying “I can’t find my fish in the ocean, so there is no fish in the ocean.” That would be a very self-centered and narrow view of information, and frankly, plain wrong. So just because a data set doesn’t have the information you want, it doesn’t mean this data set has no information.
The key to measuring information lies in the redundant nature of most data sets, because information is only the non-redundant portions of the data. So if we can remove ALL the redundancy in a data set, then what’s left should be the non-redundant portion of the data (i.e. information). The challenge, of course, is how do we identify ALL the redundancy within a data set?
This is a nontrivial problem, because statistical redundancy (a.k.a. entropic redundancy) is not the same as duplicate data. Duplicate data is only the most obvious kind of redundancy, while there are literally infinite ways that data can be redundant. For example, the fact that I am Asian, and I have black hair is statistically redundant, because one fact is not completely independent of the other. There are many ways that statistical redundancy can arise in data, some are totally not obvious, and some even cryptic. You see the challenge?
Information and Data Compression
Recall from my previous post that any data (especially big data) has lots of built-in redundancies. Compression algorithms often leverage the redundancy in the data to compress them. A lossless compression algorithm works by removing the redundancy in data and storing only the non-redundant parts (which is usually much smaller), so the size of the compressed data file is reduced. However, the compressed data file has all the non-redundant data, so the remaining (redundant) data can be reconstructed to recover the original data in its uncompressed form.
If you have an image (say 1000 x 1000 pixels) of a blue sky (Figure 1a) how much data is it? Well, you can download Figure 1a and let your operating system tell you its file size, or you can reason with me here. This image has 1000x1000 = 1 million pixel, each having 3 color channels (red, green and blue). Since each channel is 8 bits deep (i.e. 1 byte), that is a total of 3 million bytes of data. Since a 1 MB = 1024x1024 = 1,048,576 bytes, this image would be 3,000,000 ÷ 1,048,576 = 2.86 MBs if it’s stored as a raw bitmap. However, it compresses very well (via PNG compression) to a fairly small file size of 940 kB (Figure 1b) because there is a lot of redundancy from one pixel to the next (i.e. most of the pixels are blue).
Now, if you have another image (same dimensions) of a city scene that is very complex with a lot of buildings and people in it (Figure 1c), how much data does it contain? Since this image is exactly the same size as Figure 1a, it contains exactly 2.86 MB of data (same as Figure 1a). Don’t believe it? Download Figure 1a and 1c and see what your operating system tell you. Now, since Figure 1c has less redundancy from one pixel to the next (i.e. the pixels have many different colors), it won’t compress as well (via PNG). Consequently the compressed size of the city scene is 1.73 MB (Figure 1d), much larger than the blue sky. This is a reflection of the fact that the city scene actually contains more information (less redundant data) than the blue sky.
As you can see, the compressed size of an image is a pretty good proxy to the amount of information in the image. The amazing fact is that this principle applies to all data (not just image data). Even though this is only a proxy, it is often good enough to estimate the amount of information in a data set. Since most compression algorithms are not optimal, the compressed file size is actually a conservative estimate of the extractable information within any data set.
So to measure how much information you can derived from your big data, you can simply take your most powerful lossless compression algorithm and compress your big data to its smallest possible size. You can even use several lossless algorithms because different compression algorithm Since the lossless algorithm will not lose any information, the compressed size is the maximum amount of information you can possibly extract from your big data. Isn’t that cool?
I bet you wouldn’t think of using compression algorithms to measuring the amount of information in a data set. I certainly wouldn’t. And that is the genius of Claude Shannon who invented information theory, which enabled the entire mobile industry. And he did that by noticing this bizarre connection between compression and information content.
An Intuition for the Data-Information Inequality
Now that we can use compressed size to measure the amount of information that is extractable from a data set, let’s try to get some intuition about this measure. If I have 1 kB of information, how much information (not data) is that? We have a pretty good intuition about how much data is 1 kB. It’s not much, since each text character is 1 byte, 1 kB is only 1024 characters, roughly 150+ words, or 1 paragraph of text. But we have little intuition for how much is 1 kB of information. Remember: data ≠ information.
Surprisingly, 1 kB of information is actually quite a bit of information, literally... Because a single piece of information is one independent fact, or 1 bit of information. So 1 kB = (1024 bytes) x (8 bits/byte) = 8192 bits of information. That is actually a lot of information. I don’t even know if I can write down 8192 facts about myself, let alone independent facts (i.e. facts that can’t be inferred from other facts). For example, my birthday and my age are not independent facts, even though they are two different facts about me (clearly, one can infer my age from my birthday).
If you have 1kB of information (not data) that is supposed to help you make a binary decision between buying an iPhone or an Android (see Big Data Analytics: Reducing Zettabytes of Data Down to a Few Bits), then 1 kB of information may be 8192 independent facts about these smart phones or 8192 independent comparisons of each smart phone’s features. I bet you can’t even name 100 independent reasons why you got your smart phone. So don’t underestimate a mere 1 kB of information.
Is big data is valuable? Yes, definitely. But its value is highly exaggerated, because the amount of entropic redundancy in data is huge, therefore the amount of information we can extract from data is tiny. We can measure the total amount of extractable information (through any means) via compression, and information is usually many orders of magnitude smaller than the raw data. This is not a trivial connection, and one that has enlightened even the greatest minds of this century.
However, information is measured on a totally different scale than data, even though they are measured in the same units as data (i.e. in bits, bytes, kB, MB, etc.). For example, 1 kB of data is nothing in term of modern day storage, but 1 kB of pure information is actually a lot, more than most human brains can handle. The sheer volume of raw data that is needed to derive 1 kB of information may be up in the range of gigabytes. This difference is like a pound of steel vs. a pound of cotton: they weight the same (both 1 lb), but it’s going to take a lot more cotton (volume) to weight 1 lb.
Alright, I hope this post gives you a little more appreciation to the depth behind the data-information inequality: information << data. Yet, it is simple and elegant, like Einstein’s famous mass-energy equation: E = mc2. Next time we’ll try to understand why is there so much redundancy in data (any data). So stay tuned!
Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.
You must be a registered user to add a comment here. If you've already registered, please log in. If you haven't registered yet, please register and log in.