Science of Social blog

We'll be doing a quick restart of the community around 11:00 AM Pacific. Please make sure you save any drafts to avoid losing your work!
Showing results for 
Search instead for 
Do you mean 

Why is there so Much Statistical Redundancy in Big Data?

By MikeW

Why is there so Much Statistical Redundancy in Big Data?

by Lithium Guru ‎10-08-2012 01:04 PM - edited ‎02-01-2013 11:55 AM

For the last few weeks, I’ve been talking about the data-information inequality: information << data.

  1. The Big Data Fallacy: Data ≠ Information
  2. How Much Information is in Your Big Data and How Can You Measure It?

 

So by now you should understand the reason that your “information volume” is so much smaller than your “data volume” is due to the redundancy in your data. You should also understand the information that I am talking about here is the entropic information, which represents the maximum amount of information that anyone can derive from a data set by any means. Moreover, the statistical redundancy in a data set is much more complex than mere duplicate data. Nevertheless, this redundancy can be quantified rigorously, so the maximum amount of extractable information can be measured through lossless compression algorithms.

 

However, I haven’t explained why data (any data, regardless of how it’s created, what it represents, and how it’s stored) have so much redundancy in the first place. Statistical redundancy is very pervasive and can literally arise in an infinite number of ways. Today, we will look at three explanations that accounts for most of the statistical redundancy in data.

 

Data Duplication

The most obvious kind of statistical redundancy is data duplication. The majority of data duplication is created by humans, so this kind of redundancy is, typically, intentional. There are many systems in nature (e.g. DNA replication) that can produce exact copies of the same stuff. However, we also expect this to be the normal behavior for these systems, so we rarely record data on these normal behaviors. Instead, we tend to capture data only when errors or anomalies occur, and these data are rarely duplicates.

 

duplicate_content2.pngAny time we press CTRL-C, we are potentially contributing to the intentional duplication of data. We do this so often that we become unaware how much data we actually copied. And we may copy data for many reasons:

  1. Sharing and re-sharing content with other people: data transfer, collaboration, etc.
  2. Collecting, backing up, versioning, and version control of digital content through manual or automatic process
  3. Publishing for mass consumption through any media
  4. Co-syndicating content across multiple communication and social channels
  5. Repetition of sharing, status update, check-ins at different time and/or location

 

Because data duplication is so obvious, it can be easily identify in data. They are often the least interesting in terms of the insight they can provide. Most analytics around any form of duplicate data is based on counting + simple arithmetic.

 

Slow Change: Serial Correlation

The most common mechanism that creates statistical redundancy naturally in data has to do with the seemingly slow changing nature of our world with respect to the temporal frequency of our measurement devices. Most modern computers are running in the gigahertz range (say a modest 2 GHz). That means the internal clock is ticking about 2 billion times a second. According to the Nyquist-Shannon sampling theorem (from the same ingenious Claude Shannon we talked about last time), this computer is able to accurately measure events that changes as fast as 1 billionth of a second (1 ns, nanosecond). That is pretty fast!

 

However, natural processes follow certain dynamics that are governed by laws of physics. Consequently, measurements and data about these systems will reflect the inherent time scale of these dynamics. For example, weather typically varies on the orders of days. If there is a storm, the weather may change faster (~ hours). Still, there are much faster processes, such as cellular processes (~ seconds) or human cognitive processes (~ milliseconds). If you are interested in social systems (whether they are online or offline) like me, the dynamics involved are much more complicated and there are usually multiple time scales at work, spanning a wide range from hours to months.

 

movie-reel3p.pngAll these processes are very slow compared to the 1 nanosecond clocks in our measuring devices. Therefore, almost all temporal data appears slow changing (except data from systems that have much faster temporal dynamics than the measuring devices, such as atomic, electronic, and nuclear processes). As a result, data from one time sample to the next are more or less the same (i.e. they are highly correlated in time). Hence this type of statistical redundancy is called serial correlation (or autocorrelation).

 

The best example to illustrate this type of redundancy is in video data. If you look at two consecutive frames of video data, the images of these two frames will look very similar most of the time (unless you happen to look at a transition frame, which is a fairly low probability event). So there is huge amount of statistical redundancy from one video frame to the next, even though the two frames are not exactly the same. Because video data has so much temporal redundancy, we rarely store video in raw format, so video data is almost always stored in compressed form (e.g. MPEG).

 

Because serial correlation is much harder to identify, the insights it can provide are also more profound. However, more advanced analytics, such as time series analysis, is required to extract these insights.

 

Causality: Cross-Correlation

The most fundamental mechanism that creates redundancy in data is because our world is causal. Sequences of events don’t just happen randomly in nature. Regardless of whether we understand the causal mechanism or not, most things in this world follow a causal sequence, where one event leads to another.

 

A direct consequence of the causal world is that many apparently unrelated events that we observed (and measure) are no longer independent from one another. They may have some deep underlying causes that relate them in some obscure ways. The way this manifests in the data we collect is that two or more seemingly unrelated variables we measured will become correlated, because they may have a common cause.

 

This would mean that we can infer something (not everything) about one of the variable from the other, meaning there is some redundancy between these two variables. Again, even though the data representing these two variables may be completely different, the cross-correlation between them is sufficient to create statistical redundancy.

 

butterfly-influence3.pngThe best example to illustrate this type of redundancy is in financial data. If we pick two completely unrelated stocks and watch it over a long period of time, we will discover that they are not completely independent. Despite that they may be from completely unrelated industries, both are subject to the same economic conditions and similar market forces. These common causes will modulate both stocks similarly and make them fluctuate more coherently, and this will increase the cross-correlation between these two stock time series.

 

However, the exact relationship between the two stocks may still be very complex and non-linear, and we will need powerful analytics, such as nonlinear regression, to characterize the relationship between them.

 

Conclusion

Although statistical redundancy can be created artificially when we copy data, the main reason why there is so much built-in redundancy in data is because our world is non-random. Nature operates under the laws of physics, and many events have hidden causal relationships that we may not even understand yet. People’s behaviors are not truly random either. People’s interest and taste for good content do not change unpredictably. Hence, the content they create, consume, and share, won't be completely random. Moreover, most people behave rather consistently and predictably across time and space.

 

As a result, the data set that reflects these non-random processes will exhibit correlation naturally. More specifically, any data set may exhibit two types of correlation:

  1. Serial correlation: correlation within a single variable
  2. Cross-correlation: correlation between several variables

 

Big data being so big will certainly have lots of data for any single variable. This will increase the probability of finding serial correlation within any one variable. Moreover, big data also has great diversity of data type (i.e. different variables). This will increase the probability of finding cross-correlation between any pairs of variables. Therefore, as big data becomes bigger and more diverse, it will also have more statistical redundancy.

 


 

Michael Wu, Ph.D.mwu_whiteKangolHat_blog.jpg is 927iC9C1FD6224627807Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.

 

Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.

 

Comments
by Emeka(anon) on ‎10-16-2012 03:24 AM
I am wondering if by statistical redundancy you also mean noise? Regardless I hold the opinion that all data is inaccurate, and on this premise, the bigger data gets the more likelihood it needs cleaning.
by Lithium Guru ‎10-16-2012 12:39 PM - edited ‎10-16-2012 12:41 PM

Hello Emeka,

 

Thank you for asking this question.

 

Statistical redundancy is not noise. To understand what signal and noise are and how they relate to data and information you should take a look at my reply to Larry Iron. In short, statistical redundancy are any two pieces of data/information your data set that are not independent from each other. That is you can predict something about one from the other. I just didn't want to copy and paste the answer here again, because I've already answered this exact question before.

 

Unfortunately, it does require a little understanding of statistics and what statistical independence means. 

 

Let me know if you still don't get it after you've read through my previous answer, and I will try to expand a little more.

OK, see you next time.

by Relay(anon) on ‎10-24-2012 01:09 PM

Dr. Wu,

 

Thank you for the interesting discussion. If I may, I would like to add to the discussion.

 

Given your scientific background, most likely I'll be repeating what you are well familiar with, in which case please acdept my apologies upfront. In addition to the reasons mentioned, redundancy also arises due to:

(1) Multi-modal observation
(2) Multi-channel recordings

For these cases, the events under observation need not be "seemingly uncorrelated", nor necessarily temporally correlated.
Multi-channel recordings are common in scientific experiments; often the study may involve sampling with sensors across space (eg geophysics, or neurological). Depending on the spatial variation of the event, the recorded data may or may not be redundant. In the case of neural recordings, say using EEG (electroencephalograms), clearly there is varying degree of information duplication across sensors at each instant. Similarly, worldwide recordings of a major earthquake will show
redundancies. Multi-modal recordings are simultaneous observations of a given event using different means. Again, take for example neural recordings - one can measure the neural activity using EEG and MEG (magnetoencephalogram) simultaneously. Same event, but different "views" of the event. Use of different modalities is justified on the grounds that one modatility can capture certain kinds of information better than the other (or absent in the other), and together they complement each other. Needless adding, there is much redundancy across these two modalities.

Given my background, I've provided scientific examples. I wouldn't be surprised if analogies can drawn for business situations and applications.

@relay70f

 

by Lithium Guru ‎10-24-2012 10:52 PM - edited ‎10-24-2012 11:57 PM

Hello Relay,

 

Thanks for contributing to this discussion.

Please feel free to contribute to the discussion anytime. It never hurts to have a different perspective.

 

The two mechanism for redundancy you proposed are actually subsumed in the causal nature of the world (i.e. the 3 mechanism in my blog). Measuring the same events under different condition or different modality by different technology means that the data actually has the same underlying cause. And this causal nature is what leads to a lot of correlation and therefore redundancy in the data.

 

It is definitely another way to look at how causality creates redundancy. And you are totally right. Lots of measurement are actually measuring the same thing under a different variable. At least people in the scientific community are honest about the fact that they are measuring the same thing, which is a good thing. It definitely creates less confusion that way.

 

BTW, I used to work in a neurophysiology and fMRI lab, so EEG and MEG are common technologies that I play with.  ;-)

Anyway, thanks for the discussion, and see you next time.

 

by Relay(anon) on ‎10-27-2012 09:08 AM

Hello Dr. Wu, Thank you for your response. Yes, I hear you that the cases I mention can be viewed as a special case of causality. I should have said that the 2 cases are additional examples from the scientific world. Thanks. -R