Why is there so Much Statistical Redundancy in Big Data?
For the last few weeks, I’ve been talking about the data-information inequality: information << data.
- The Big Data Fallacy: Data ≠ Information
- How Much Information is in Your Big Data and How Can You Measure It?
So by now you should understand the reason that your “information volume” is so much smaller than your “data volume” is due to the redundancy in your data. You should also understand the information that I am talking about here is the entropic information, which represents the maximum amount of information that anyone can derive from a data set by any means. Moreover, the statistical redundancy in a data set is much more complex than mere duplicate data. Nevertheless, this redundancy can be quantified rigorously, so the maximum amount of extractable information can be measured through lossless compression algorithms.
However, I haven’t explained why data (any data, regardless of how it’s created, what it represents, and how it’s stored) have so much redundancy in the first place. Statistical redundancy is very pervasive and can literally arise in an infinite number of ways. Today, we will look at three explanations that accounts for most of the statistical redundancy in data.
The most obvious kind of statistical redundancy is data duplication. The majority of data duplication is created by humans, so this kind of redundancy is, typically, intentional. There are many systems in nature (e.g. DNA replication) that can produce exact copies of the same stuff. However, we also expect this to be the normal behavior for these systems, so we rarely record data on these normal behaviors. Instead, we tend to capture data only when errors or anomalies occur, and these data are rarely duplicates.
Any time we press CTRL-C, we are potentially contributing to the intentional duplication of data. We do this so often that we become unaware how much data we actually copied. And we may copy data for many reasons:
- Sharing and re-sharing content with other people: data transfer, collaboration, etc.
- Collecting, backing up, versioning, and version control of digital content through manual or automatic process
- Publishing for mass consumption through any media
- Co-syndicating content across multiple communication and social channels
- Repetition of sharing, status update, check-ins at different time and/or location
Because data duplication is so obvious, it can be easily identify in data. They are often the least interesting in terms of the insight they can provide. Most analytics around any form of duplicate data is based on counting + simple arithmetic.
Slow Change: Serial Correlation
The most common mechanism that creates statistical redundancy naturally in data has to do with the seemingly slow changing nature of our world with respect to the temporal frequency of our measurement devices. Most modern computers are running in the gigahertz range (say a modest 2 GHz). That means the internal clock is ticking about 2 billion times a second. According to the Nyquist-Shannon sampling theorem (from the same ingenious Claude Shannon we talked about last time), this computer is able to accurately measure events that changes as fast as 1 billionth of a second (1 ns, nanosecond). That is pretty fast!
However, natural processes follow certain dynamics that are governed by laws of physics. Consequently, measurements and data about these systems will reflect the inherent time scale of these dynamics. For example, weather typically varies on the orders of days. If there is a storm, the weather may change faster (~ hours). Still, there are much faster processes, such as cellular processes (~ seconds) or human cognitive processes (~ milliseconds). If you are interested in social systems (whether they are online or offline) like me, the dynamics involved are much more complicated and there are usually multiple time scales at work, spanning a wide range from hours to months.
All these processes are very slow compared to the 1 nanosecond clocks in our measuring devices. Therefore, almost all temporal data appears slow changing (except data from systems that have much faster temporal dynamics than the measuring devices, such as atomic, electronic, and nuclear processes). As a result, data from one time sample to the next are more or less the same (i.e. they are highly correlated in time). Hence this type of statistical redundancy is called serial correlation (or autocorrelation).
The best example to illustrate this type of redundancy is in video data. If you look at two consecutive frames of video data, the images of these two frames will look very similar most of the time (unless you happen to look at a transition frame, which is a fairly low probability event). So there is huge amount of statistical redundancy from one video frame to the next, even though the two frames are not exactly the same. Because video data has so much temporal redundancy, we rarely store video in raw format, so video data is almost always stored in compressed form (e.g. MPEG).
Because serial correlation is much harder to identify, the insights it can provide are also more profound. However, more advanced analytics, such as time series analysis, is required to extract these insights.
The most fundamental mechanism that creates redundancy in data is because our world is causal. Sequences of events don’t just happen randomly in nature. Regardless of whether we understand the causal mechanism or not, most things in this world follow a causal sequence, where one event leads to another.
A direct consequence of the causal world is that many apparently unrelated events that we observed (and measure) are no longer independent from one another. They may have some deep underlying causes that relate them in some obscure ways. The way this manifests in the data we collect is that two or more seemingly unrelated variables we measured will become correlated, because they may have a common cause.
This would mean that we can infer something (not everything) about one of the variable from the other, meaning there is some redundancy between these two variables. Again, even though the data representing these two variables may be completely different, the cross-correlation between them is sufficient to create statistical redundancy.
The best example to illustrate this type of redundancy is in financial data. If we pick two completely unrelated stocks and watch it over a long period of time, we will discover that they are not completely independent. Despite that they may be from completely unrelated industries, both are subject to the same economic conditions and similar market forces. These common causes will modulate both stocks similarly and make them fluctuate more coherently, and this will increase the cross-correlation between these two stock time series.
However, the exact relationship between the two stocks may still be very complex and non-linear, and we will need powerful analytics, such as nonlinear regression, to characterize the relationship between them.
Although statistical redundancy can be created artificially when we copy data, the main reason why there is so much built-in redundancy in data is because our world is non-random. Nature operates under the laws of physics, and many events have hidden causal relationships that we may not even understand yet. People’s behaviors are not truly random either. People’s interest and taste for good content do not change unpredictably. Hence, the content they create, consume, and share, won't be completely random. Moreover, most people behave rather consistently and predictably across time and space.
As a result, the data set that reflects these non-random processes will exhibit correlation naturally. More specifically, any data set may exhibit two types of correlation:
- Serial correlation: correlation within a single variable
- Cross-correlation: correlation between several variables
Big data being so big will certainly have lots of data for any single variable. This will increase the probability of finding serial correlation within any one variable. Moreover, big data also has great diversity of data type (i.e. different variables). This will increase the probability of finding cross-correlation between any pairs of variables. Therefore, as big data becomes bigger and more diverse, it will also have more statistical redundancy.
Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.
You must be a registered user to add a comment here. If you've already registered, please log in. If you haven't registered yet, please register and log in.