Data Science

Why is Big Data So Big?

Michael Wu, Ph.D. is Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and group behavior in online communities and social networks.


Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics and its application to Social CRM.He's a regular blogger on the Lithosphere's Building Community blog and previously wrote in the Analytic Science blog. You can follow him on Twitter or Google+.



Time flies, is this already the fifth article I wrote in this analytic science mini-series? Previous posts are compiled below for easy access. If you missed any of them before, now is your chance to catch up.

  1. Making Sense of Social Data: Visualization of Twitter Interactions
  2. Big Data Analytics: Reducing Zettabytes of Data Down to a Few Bits
  3. Searching and Filtering Big Data: The 2 Sides of the “Relevance” Coin
  4. If 99.99% of Big Data is Irrelevant, Why Do We Need It?


The Moving Target

Although I’ve been talking about big data for a while, I realized that I never really defined it? How big is big? What are the precise criteria for a data set to be considered big data?


If you ask around, most big data practitioners would probably say that big data is any data that is too big to be stored, managed and analyzed via conventional database technologies. So the “data” in big data can really be anything. It doesn’t have to be social media data, and it is certainly not limited to user-generated content. It can be genomic, financial, environmental, or even astronomical. Although this definition is very simple and easy to understand, I didn’t like it, because its meaning actually changes over time.


According to Moore’s law, the speed and storage capacity of computing devices are increasing at an exponential rate. Many data sets that were once too big can now be stored and analyzed easily. So what was once considered big data isn’t big anymore. Likewise, big data today may not be big in the future as computing power continues to increase.


As you can see, it is difficult to pin point precisely how big the data needs to be for it to be considered big data; this criterion is a moving target. Rather than trying to define big data, we will take a different approach and try to identify some of their common traits. But keep in mind that these traits are not strict definitions and they do change over time.


The Data Capturing Devices

One of the most obvious characteristics of big data is that the devices for capturing those data are either already ubiquitous or becoming ubiquitous. Examples are cell phones, digital cameras, digital video recorders, etc. When any data capturing device becomes ubiquitous, there is a high probability that whatever data those devices are capturing will eventually become big data. This is pretty obvious, because more data capturing devices translate directly into a proportional increase in data production rate.


Besides the increase in capturing units, there is also an increase in the variety of data sensor and input devices. The GPS and accelerometer on your smart phone capture very different types of information even though they are really just a bunch of numbers. There is also an increase in the variety of input devices (i.e. different ways for a device to capture the same type of information). For example, search queries used to be captured strictly via a keyboard, now they can also be capture via any camera equipped with OCR, virtual keyboards on your smart phone or tablet, voice recognitions, etc.


The variety of data sensors and input devices not only increases the data production rate, it also produces an explosion of metadata for segmentation. Using the search function as an example, what used to be just queries can now be segmented into queries from computers vs. queries from mobile devices. Those from mobile devices can further be segmented into those that are input via a virtual keyboard vs. camera vs. voice. Likewise, queries can also be segmented according to their geo-location using GPS data. These are all valuable information that tells us how users are using the search function, and they certainly contribute to the size of big data.


Increased Data Resolution

Another major contributor to the bigness of big data is that data resolution is increasing rapidly. This is largely a consequence of the Moore’s Law, which says that the density of integrated circuit (IC) doubles approximately every 2 years. This means higher density CCDs in cameras and recorder, or equivalently higher image resolution. As a result, images and videos will take up more of your storage volume and make your data even bigger.


Many scientific instruments, medical diagnostics, satellite imaging systems, and telescopes benefit tremendously from this increased of spatial resolution. What used to be a blur due to a lack of resolution is now crystal clear. This can mean the difference between finding a star or a planet in a distant galaxy vs. not. And if it was a tumor that we are looking for, this could mean the difference between life and death.


Higher density IC also means faster CPU, which allows you to capture data at a higher sampling rate. This increases the data resolution in a different dimension: Time. Increased temporal resolution means instead of storing 180 frames of data for a minute of video (30 fps), now you have to store 360 frames for that same minute of video (60 fps). This will certainly make your data bigger, but the benefit can also be huge, especially for time sensitive data, for example, financial data, market reaction data, and audience measurements. The difference of a few seconds can mean the difference between making and losing millions of dollars.


Therefore, any data that is experiencing a rapid increase in data resolution (whether it is spatial, temporal or any other dimension) is likely to evolve into big data.


Super-Linear Scaling of Data Production Rate

Although there are a few more common traits among big data, I will talk about one more here in the interest of time. I call this property “super-linear scaling data production rate.”


When the rate of data production scales super-linearly with the data producer, data created by the data producer will likely grow rapidly into big data. The key concept here is super-linearity. That means for every incremental addition of data producer, there will be a disproportionately greater increment in the rate of data production.


Super-linear scaling is basically the network effect of data production. This property is particularly relevant to social data, because nearly all social media interactions scale super-linearly with the users. For example, if you have 4 users, the number of possible interactions among them is 6 (see figure 1a). But if the number of users doubles to 8 users, then the number of potential interactions among them increase more than double, in fact it more than quadruples to 28 potential interactions (see figure 1b). This is the power of super-linear scaling (a.k.a. network effect).


Because the majority of the social media data are generated through interactions between users, as more users adopt social media, the data production rate will increase super-linearly. That is why if you start capturing any social media data now, it is very likely that it will grow into big data very soon.



Since the precise criterion for “big” data is a moving target, it is useful to examine how “big” data were generated and try to identify the common traits that contribute to their “bigness.” There are at least three major factors that contribute to the bigness of big data.

  1. Ubiquity and variety of data capturing devices for different types of information
  2. Increase data resolution
  3. Super-linear scaling of data production rate with data producers


BTW, my speaking engagement schedule is getting pretty pack. I'll be at presenting at two meeting for the Consortium of Service Innovation (CSI) today. And I'll be talking about boosting the relevance of internal search algorithms to make support and knowledge contents more findable. It's a search algorithm that has both social and contextual sensitivity that I've been working on. Then another CSI program tream meeting March 21-23 in Reston, VA. And I will be talking about the Reputation Model for effective Intelligent SwarmingI am also very honored to be invited to theDeloitte University (Westlake TX) to participate in the ON Social Insights meeting (March 12-13). Lots of travels coming up for me, so I might not have time to write as much as I like to. But I'll do my best to keep up my blogging pace. 


Alright, now we know a bit about where big data come from, next time we will take a quick look at the big data processing pipeline. We will take a look at where do the big data go and what analytics/data scientists (like me) do with them. So stay tuned for more big data!



  • Big Data

First of all thank you for the interesting posts!

I'd like to throw in another perspective on the term: traditional data consumption (business applications) rely on a semantic and syntactic "aligned" data structure.

I think through and with "BigData" we are on the edge of changing that "static" view of data structures: Being able of combining data from various "unrelated" sources like Social Media, Traffic Reports, Website Statistics, Business Applications (HR, Transactional Systems) etc. we are actually empowered to create new insights, new knowledge.

I think this is BigData - not primarly in the sense of the volume of data - but in the sense of its unstructured, semantically and syntactically diverse data structures and the technical capabilities given to find "the needle in the haystack"..

What you think of this view?




First of all thank you for the interesting blogs!

Let me put another perspective on the term.

Traditionally data and its consumption was pretty much a linear thing. What I mean by that is that all computational system rely on an agreed data structure to be processed. Creating new insights in traditional business areas like finance, insurance, wholesale etc. was always difficult: You had to know(grasp) the problem, define the semantics and structure and implement the solutions.


Through "Big Data" we are able to get a much broader view (a look at) on the "problem" without even knowing it... Big Data enables us to process diversely structured and semantically disconnected (at first hand) data - from social media (unstructured) data to transactional data (e.g. financial data) and combine these with "environmental" data (location, time, physical) to form a representation of the world "in Data".

It gives us the ability to "play with data" in a much natural way. Data becomes a "creative mass".

I think this is a pretty good definition of Big Data. Big Data enables you to look at (a lot, diverse) Data and find patterns without knowing what to look for (exactly).


What you think about that?



Data Science

Hello Hardy,

First of all, thank you for the comments, both of them (I assume both of comments above are from you. If I’m wrong please speak up). I appreciate you taking the time to put forth your perspective on big data. Big data is quite a loaded term and means different things to different people. 


Although many types of big data do allow the creative combination of various data sources and mashing up of many related data, I don't believe it is a requirement though. What you mentioned is a different property of data called high dimensionality. High dimension data are those that be combined in many ways to create new perspectives, therefore, they can potentially generate new insights that we don’t already know.


For example, DNA microarray data have very high dimension, and by looking at different combinations of dimensions, we get a different perspective of the data, which can potentially generate insights. Since the number of combinations grow exponentially as the number of dimensions, microarray data can be combined and mash up in ways you can’t imagine.


Another point is that structure or unstructured data has nothing to do with how big your data is. There are structure data that are really big (e.g. radio astronomy data). And there are unstructured data that are small enough to fit in your laptop. For example, free-form interviews or free-form survey data are usually totally unstructured, but because the sample size of these data are often limited, they can usually fit in researcher’s laptop.


Lastly, I think that many people equating big data to social media data. That is actually a very narrow view of big data. There many big data that have nothing to do with social, and they can actually be bigger than social media data. Social media data just happens to be available, accessible, and have a market interest, so people know about them. It happens that social media data are “big” in the big data sense, and they also happen to be high dimensional, as well as unstructured. However, these 3 properties of social media data are not necessarily correlated in other types of data.


Nevertheless, what you said is true if you restrict your view of big data to only social media data. As a data scientist, I just like to take a more general view.


Alright, I hope this gives you a different view on big data.

Thanks again for your perspectives, and I hope to see you next time.


Hey Michael

..yeah both were from me. Sorry for that, but I thought after registering and not seeing my first post, that it was gone. Must have been a IE9 bug..


I completely agree with you. What I tried to explain was similar to what you layed out. BigData is not only about social media data - but it really earns the term when you start to combine "high dimensional" and "unstructured" data for a certrain purpose. IMHO it's not (only) about the arbitrary size, nor where it comes from (sensors, gps, etc.) or its "quality" (resolution), but the actual "diversity" and the directed "purpose" - the questions you are seeking to solve with the data.

So - even though I'm not a data scientist - I get to see a lot of interest sparking up from "combining" data and getting a 360 degree view of your customers. To me - these are all questions to be solved by "Big Data" and "Big Insight".


Thanks for bloggin about these topics!



Data Science

Hello Hardy,


Glad to see you back and thanks for continuing the discussion on this interesting topic.


First of all, it is not your IE9 problem, it is our moderation policy. Posts from non-registered users are moderated by human moderators before they go public. That is why you don't see it immediately. Our fault  ;-)


Good observation! The diversity as well as the purpose of the data is certainly a property of many big data today. Moreover, much of the data were recorded with a very different purpose in mind than what can be make use of it. And it is through the creative combination of different data sources that we get this data explosion. Because we are combining data in new ways, we end up deriving new data. And I think that you are trying to refer to these diverse, multi-purpose, high dimension, unstructure, recombinable type of data as big data right?


If that is the case, then many "big data" would not need today's big data technology (e.g. the SMAQ stack) for its capture, storage and processing. Although that is fine and it’s probably just the terminology, but it just seem a little weird to me. I would consider this case a confusing terminology, since the term big data has hidden implication that it needs to be captured, stored, and processed by big data technologies. I would classify your data as smart data, then there wouldn’t be any terminology conflict. So big data still refers to the volume, as what most people in the industry refers to, but smart data can be big or not so big in volume, but they have all the nice properties you mentioned.


That seem to be are more consistent way of looking at data in general.  ;-)

Thanks again for the discussion. I think I will expand on this idea in a later blog.

So thank you and hope to see you again next time.


Hi Michael,

Moore'slaw is not about storage space but about number of transistor. And is not exponential. Moore observation was that the number of transistors was duplicating every 18 months. Maybe a power law?

The law is nowadays applied indirectly to processing speed and capacity and storage but I would say the associations are incorrect, being not Moore's initial observation and also due to the fact that new technologies are in play nowadays for storage that Moore could not even envision in 1965.


Plus As you mention Big data is a moving target as is moving processing capacity. The gap between processing capabilities person have access (either on devices or cloud) will keep on increasing for the simple fact that data available has different multipliers that processing capacity. Big data is generate by: number of content producer (people or devices: both increasing and are huge amounts) * volume of data (exploding due to high resolutions, redundancy and easiness of data collection)



Data Science

Hello Elio,


Thank you for the comment. I apologize for the late reply as I’ve been on the road quite a bit, and just returned from Deloitte Univ. yesterday. 


You are correct. Moore’s original statement is about transistor count. The period of doubling can vary from 18 month to 24 month, and it is an effect of both the transistor packing as well as the solid state advancement in transistor technology. However, the capacity and performance of many digital devices are strongly linked to Moore’s Law. There are equivalent laws for transistor density, hard disk storage (a.k.a. Kryder’s Law), network bandwidth (Butter’s Law), image resolution (Hendy’s Law), etc. Since these related formulation can often be seen as a consequence of Moore’s prediction, even though Moore couldn’t have known all the technology existed today, I want to give him some credit for predicting the increase performance in storage, bandwidth, as well as resolution.


That is the whole point of prediction, it is because we don’t know everything. If we know everything about the future, then it wouldn’t be a prediction anymore, we just know it. For example, I wouldn’t give anyone credit for predicting what would happen if someone drops a grand piano from the Empire State Building, because we all know it will drop and shatter. We know what would happen even though no one has actually done that, so I wouldn’t call this a prediction. It’s just a knowledge or fact. Likewise, if Moore knew every solid state technological advance in the modern days his law wouldn’t be so impressive anymore. It would just be an obvious fact that everyone knows.


Finally, I like your simple formulation. And that is precisely what I’ve talk about in this article. Big data is big because of many reasons, but some of the most important contributors are, as you mentioned,


  1. Increase number of content producer, which includes people and devices
  2. Increase data volumne, which includes increase in resolution as well as redundancy (as a result of easiness to copy and share content).


But there are many more. For example, the easiness to create new data from existing data, which consist of extracting data derivatives from pre-existing data, as well as, novel combination and mash-up of existing data. We’ve only scratch the surface here! If we were given the time, I bet we can write a whole book on this topic.  ;-)


Anyway, thank you for your comment and correction.

I appreciate the discussion, and I hope to see you again next time.


Thanks for your feedback. You might be right about the SMAQ-stack - but you sure get a lot of bang for the buck from it.. Smiley Happy

If I look at more "traditional" or especially "proprietary" hard- and software stacks, then you would have to invest much more to get the same flexbility, stability and computational power. What I like about SMAQ is the fact that you can start fairly small - test your data and application design and scale fairly easily with quite low requirements.

To me that's also one facet of Big Data and it's underlying technological concepts (MapReduce, etc.).


And I'd like to pick up your response to Elio: Isn't BigData also exactly about the "problem" you are stating in the last paragraph? (Data)information "overload" and the (base)tools to overcome it (massive parallel computation for DataMinig, ML, etc.)?


Book sound good - but make it an ebook..



Data Science

Hello Hardy,


Nice to see you back again and thanks for continuing the discussion.


Yes, the SMAQ stack is relatively much cheaper with respect to the cost of talent that is necessary to use these technology. But eventually people will acquire these skills and educational institutions will begin to adopt and teach them too. So hopefully the cost of talent will reduce over time as the supply of data scientists increase.


If you are saying big data technology, such as Hadoop, then the ease of scaling is certainly a defining property of the big data technology. However, I like to say that big data is not the same as big data technology. Big data refers to the data, and big data technology refers to the hardware and software (e.g. SMAQ stack) for storing and processing the big data.


But it is true that most big data do tend to create an information overload, that is why we need search and filtering to winnow it down to the data we want. This, however, is a consequence of big data, not how it is created though. Besides, data mining and machine learning are not limited to big data, so I wouldn’t consider them as big data technology. They can be apply to any data.


Lithium and I did publish an e-book the beginning of this year, but it’s not on big data though. Maybe that will be the next book.  ;-)


Alright, I hope I’ve address your questions.

Thanks again for continuing the conversation.


Hi Michael,

Enjoyed the read.  I agree with you that Big Data is not about just volume (which is a byproduct factor) and not purely about structured and unstructured data.


The essence of Big Data is really a combination of factors that have come together lately, especially the past few years. Increase in the number of sources is definitely one of the top ones, and that is compounded by the frequency & velocity of the data generation along with multitude of data types.


And while I agree with you that unstructured data is not Big Data, data explosion in the unstructured space (from mobile, social etc.) has contributed significantly to the Big Data growth and hype.


At the end of the day, I think the key is not whether you are tethered to Big Data or not -- the important factor is how you are planning to use it and mine it for action that will help your business.




Data Science

Hello Ned,


Thank for the comment and the affirmation.


It is certainly true that big data (the way it is) now is a combination of many factors. However, the primary reason that we needed big data technology is primarily due to the volume of the big data. It just happens that the other properties, such as variety of sources, unstructure nature, etc., comes along with it. That is why I wanted to examine the reason that lead to the production of big data.


What I wanted to convey is that although big data has these property; these properties are not inherent to the "bigness" of big data. 


But I totally agree with you that the important thing is not whether you have big data or not. You can even buy the data if you need it. And it will only get cheaper. It is what you do with it that has valuable. Otherwise, it's just taking hard drives and gathering dust.


Look forward to more conversation in the future.

Thanks for the comment again, and see you next time.



Lithy Awards 2017

Voting is now closed. Winners will be announced on June 14th!! See them here!

See the nominees!