Data Science

Exploratory Data Analysis: Playing with Big Data

In my previous big data post, we discussed the three necessary criteria for information to provide insights that are valuable. Through this discussion, we learned the key to insights discovery.


By definition, an insight must provide something we don’t already know. However, we typically don’t know what we don’t know, so we can’t really look for insights, since we won't know what to look for if we don't know what it is a priori. What we need to do is to temporarily forget about the value proposition of the data analysis and look beyond what’s relevant to the immediate problem we are trying to solve. Although there is no guarantee that we will find anything in the land of irrelevance, but ironically that is usually where insights are discovered.


How exactly do we do this? That is the topic we will discuss today.


Exploratory Data Analysis

When a good data scientist analyzes any complex data set, especially those that have high dimensionality, his first step is usually playing with data. John Tukey (the famous statisticians in the 20th century who coined the term “bit” for binary digits) calls this step Exploratory Data Analysis (EDA). For me, it really is just playing with the data, because there are no standard procedures or prescriptive methods for approaching it.



EDA involves looking at the data from many different angles. Slicing and dicing the data along non-trivial, non-orthogonal dimensions and combinations of dimensions. Transforming the data through some nonlinear operators, projecting the data onto a different subspace, and then examine the resulting distribution. Regardless of what it involves, the number of things to try in each of these steps is infinite. Worse, it’s actually uncountably infinite, so we cannot even enumerate them and then perform an exhaustive search by trying each one systematically.


If this sounds like gobbledygook to you, it should. Because it doesn't matter, they are just fancy statistical jargons for play, experimentation and exploration. Since there are inexhaustible numbers of ways to explore an infinite space, use your imagination and be creative!


Our brain is still, by far, the world most powerful nonlinear processor. It can perform many challenging tasks (e.g. recognize complex patterns, detect obscure outliers, discover hidden relationships, etc.) far better than the most sophisticated machine learning algorithm or statistical method available to us today. However, it is only through "play" that we can realize the full potential of this evolutionarily optimized processor (i.e. the brain).


When we play with data, we get a feel for the data. We get a sense of what might be an interesting thing to look for, and where/how might we find it. This will guide the rest of the analyses downstream, and determines the direction and the course of the data exploration. Consequently, EDA is extremely important, as it will often determine the success or failure of an analytics project. Heading off in a wrong direction not only wastes time and resources, it often results in termination of funding for the project. EDA is often the most critical and the difficult step of any analysis project, yet it is also what makes analytics fun.


With complex data sets, such as social data, you will never find anything new if you don't play with the data. You will find the information you look for, but you will not discover insights that you don’t already know without EDA.


Structured Play: More Than Just Imagination

One very important component of EDA is the creative application of analytical techniques on the data set. Many people have asked me, “What do I need to do to be creative?” I would say, “If it’s something that I can tell you, it wouldn’t be creative anymore.”


People have been looking for the magic formula for creativity for a long time. But the novelty and originality requirement of the creative process means it can’t possibly be a formula that we can reuse over and over again. In this respect, EDA is like art, music, writing, photography, or other creative disciplines. You must be novel, original, and imaginative to be successful and find something interesting.


However, pure creativity isn’t enough. Imagination without knowledge may lead to dead ends. You can be very creative and develop a truly novel set of analyses. But if there are any logical flaws in the analyses, the result may be invalid and misleading. Having a misleading result in EDA is worse than having no result at all, because it will guide you down a path that will eventually prove to be futile.


With most creative disciplines, the evaluation of the final result is somewhat subjective. There really isn’t a correct answer for what makes a piece of art great, or which piece of art is better. It’s mostly in the eyes of the beholders. But there are objective methods for evaluating EDA, and we can objectively quantify which answer is better. So EDA is not just an art. It is also a very rigorous science, and must meet all the stringent logical requirements of mathematics and statistics.


In a way, it's kind of like rock climbing. There are many possible ways to get to the top. However, you can't just do anything you like, or it will take you too long to get there.


EDA is a kind of play, but it’s a very structured play, where one must conform to 2,000 years of rules and logic accumulated through the history of statistical science. What I thought is novel and original isn’t enough, because it may simply be my ignorance about what others have done and failed in the past. EDA is one of those disciplines that requires both high level of imagination (to be novel) as well as substantial amount of domain knowledge in statistics (to prevent flawed analyses and logics). That’s why EDA is so challenging.



Albert Einstein (one of the greatest physicists of the 20th century and Nobel laureate) once said, “To raise new questions, new possibilities, and to regard old problems from a new angle, requires creative imagination and marks real advances in science.”


Likewise, this type of creative imagination is also required for us to truly advance our understanding of human behavior and the collective dynamics of social systems. It all begins with exploratory data analysis (EDA), which is really just another term for playing with data. However, pure imaginative play isn’t enough, because unconstrained EDA often lead to too many inconclusive results. Imagination and domain knowledge in statistics are both necessary to maximize the likelihood of insight discovery. That is why EDA is challenging, but it’s also what makes it fun.


So give your data scientist a little bit of freedom to play with the data. You may not find anything, but you may also find a diamond in the rough.


In the meantime, I’m happy to discuss the details of any actual analyses you may want to perform during your EDA. Next time let’s venture into the more mechanical parts of big data analytics.



Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.


Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.


  • Big Data

Michael, nice blog post thanks. I hope my dry joke is a good response to the points you raise about the value of the creative process in performing EDA...


Q: How do you find a needle in a haystack?


The linear approach: Draw up a R&D proposal for an innovative device, costing $10m in budget and just under $20m on final completion once regulatory approval has been achieved. This device can harvest for needles in any given haystack in any terrain at any time of the day or night, and be operated in-situ or remotely.


The non-linear approach: Faced with such a heterogeneous organisation of data you assemble a bunch of friends (size dependent on amount of free food & drink) and hold a wild party on the haystack. One of the partygoers will bound to find the needle simply by stepping, or sitting on it. Or if they don't something much more strange + interesting will appear, so that the needle is simply classified as a variant hay-straw and ignored. And the new discovery classified as the strange attractor - that "diamond in the rough" you refer to!

Data Science

Hello Stuart,

Thank you for the comment. And your joke definitely captured the creative and nonlinear aspect of Exploratory Data Analysis (EDA).

However, I must re-emphasize that pure creativity is not sufficient, because it may take you too long to find the insights. There are also a lot of rules, techniques, and methodologies to EDA. In fact there are so many that John Tukey himself said that it cannot be catalogued.

"No catalogue of techniques can convey a willingness to look for what can be seen, whether or not anticipated. Yet this is at the heart of exploratory data analysis." -- John Tukey

So EDA should be seen as the creative and nonlinear exploration of data within a vast boundary of tools developed in mathematical and statistical science. And because math and statistics is so general, and its boundary so vast, it rarely pose constraint on the creative aspects of EDA.

Alright, thanks again for the conversation.
And I hope to see you again next time.

Thank you MikeW, this really resonates. Nice to hear that the human mind is for the moment unsurpassed when it comes to pattern recognition. I found this Wind Map webpage to be a nice metaphor for data points being linked to form vectors that subsequently form a dynamic system.


Data Science

Hello TommyL,


Thanks for commenting on my blog.


Yeah, If you read my bio, you will know that prior to my entry to the industry, I was a computatinal neuroscientist, who does research on primate visual processing. There are just so many places that the human visual system that is superior to machine vision. Pattern recognition is one, but the ability to generalize is another that is very hard to match by machines. We see a chiwawa and a German shepherd, we know they are both dogs dispite the stark difference in appearance.


I'm glad this post resonated with you.


And thx for sharing the wind map page. There are facilities that maps data to visual and auditory patterns (like the wind map) specifically to leverage human's ability ot recognize patterns and detect outliers in the data. Here are a few examples:

  1. Allosphere in UCSB
  2. Deloitte's Analytic HIVE (Highly Immersive Visual Environment) 
  3. ASU's Decision Theater 

OK, thanks for commeting and sharing my work.

I hope to see you next time.



Lithy Awards 2017

Voting is now closed. Winners will be announced on June 14th!!

See the nominees!