If 99.99% of Big Data is Irrelevant, Why Do We Need It?
Michael Wu, Ph.D. is Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and group behavior in online communities and social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics and its application to Social CRM.He's a regular blogger on the Lithosphere's Building Community blog and previously wrote in the Analytic Science blog. You can follow him on Twitter or Google+.
OK, let’s pick up where we left off. In my last post, we examined the first step in any big data processing engine – searching and filtering. In essence, the goal is to identify the relevant data from the irrelevant data (noise). If you’ve analyzed any form of big data, you probably noticed that the signal-to-noise ratio is pretty low. Most of the data are noise, and only a tiny fraction is the signal. The question then is, why do we need big data?
Depending on which camp you’re from, there are many answers to this excellent question. I will talk about three in this post, but keep in mind that there are many more. Just because you can track, store, and analyze big data, doesn’t mean you should.
The Uncertain Future
One of the most common arguments favoring big data is that data is versatile and doesn’t really have a shelf life. Even though you don’t need it today, its relevance and utility may become apparent in the future. And since you never know what you might need in the future, you might as well store everything that you can now.
This argument is almost tautological. That means it is irrefutably true no matter how you interpret it. Since the future is inaccessible (at least for now), and humans are risk aversive, we will always want to hedge against the unknown future. The only question left is how cheaply can we track and store these data? If it is cheap, this approach makes sense!
Although data storage is relatively cheap these days, there are hidden costs in big data initiative beyond the mere cost of hard drives. Since big data are so big that they cannot be stored in, nor analyzed on conventional databases, you need a completely new stack of technology for its capture, storage and analysis. This stack is known as the SMAQ stack (i.e. Storage, MapReduce and Query). One of the most popular SMAQ stacks is based on Hadoop, an open source implementation of Google File System (GFS). So the actual SMAQ stack itself isn’t expensive. The cost is the new talent that is needed to use this stack effectively so enterprises can derive insights from the big data.
Despite the fact that big data technology is relatively cheap, the total cost of ownership (TCO) of any big data initiative may still be quite high when you factor in the cost of human resources. So, big data is definitely an investment that may not be right for everyone.
Your Signal is My Noise
Let’s look at a different argument for big data. Although the relevant data is not big at all, the overlap between everyone's relevant data is also tiny. That means everyone's relevant data is quite different and there is very little overlap between them. What is relevant to me may be completely useless to you and vice versa. Likewise, your signal is probably somebody else's noise. Since we usually don’t know who will be looking at these data, we must store everything we can in order to better serve everyone.
The small overlap in relevance is most apparent in Data as a Service (DaaS) vendors like Social Media Monitoring (SMM, a.k.a. Listening Platforms). If you are a company or a brand using SMM, you are probably concerned with the conversation about you and your competitors. That is actually a very tiny fraction of the conversation on social media because there are conversations about hundreds of thousands of different brands out there. Every brand will be interested in the conversation about itself, and every brand will have a different set of competitors. Since no one knows which brand will subscribe which DaaS, DaaS vendors need to be prepared to serve all brands by storing all conversations on the social web.
Now, if you are not a DaaS provider (e.g. SMM or VRM) you might not need all these “big” data. For a brand, all you really need are the conversations about you and your competitors. There are several options for getting these data.
- You can capture and store the data yourself
- You can buy the data (with a big check)
- Or you can subscribe to a DaaS provider and get these data with much lower cost
Maybe You Don’t Need Big Data
Both arguments above hinge on the fact that the precise use of the big data is unknown. We don’t know what questions we may need to answer, and we don’t know what data can help answer them.
Sometimes, however, we do know the questions we need to answer. In fact, we often have some very specific business questions with regards to social media. What is the ROI? Which technology is most engaging? Who are your most valuable influencers, etc. In these cases, you don't need “big” data. You just need the “right” data, the relevant data, the precise data that addresses your question! And that is usually a pretty small data set; sometimes it can even be loaded and analyzed on a beefy personal computer.
Alright, there are probably hundreds of reasons for and against big data. I’ve talked about three here, what are your arguments for or against big data?
Although there is little dispute to the utility of big data, collecting and storing these data by yourself may not be the most economical way to get it. So when should you start thinking seriously about your own big data initiative?
- If you have access to the talent and can do it cheaply. That includes the talents to extract and analyze the relevant data in order to derive insights and value from it
- If you are a DaaS provider and need the data to serve your customers
- If you have specific questions, then all you really need is just the “right” data, which is usually not big at all!
Finally, a little preview for what’s coming. Without going too deep into the technical details, next time we will address the question of where did big data came from and how it got so big. See you next time!
You must be a registered user to add a comment here. If you've already registered, please log in. If you haven't registered yet, please register and log in.