Searching and Filtering Big Data: The 2 Sides of the “Relevance” Coin
Michael Wu, Ph.D. is Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and group behavior in online communities and social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics and its application to Social CRM.He's a regular blogger on the Lithosphere's Building Community blog and previously wrote in the Analytic Science blog. You can follow him on Twitter or Google+.
Alright, a little announcement before we begin today. First I apologize that I haven’t been keeping my up with my blogging as much as I like to. I’ve been totally swamped with our analytics architecture project. Not to mention the external engagements that I already have scheduled and committed out to June. Speaking of that, I will be giving a closing keynote today at the Social Media Strategy Summit (SMSS) in Las Vegas. So if you happen to be at The Mirage, please drop by and say hello!
A couple weeks ago, Lithium launched “The Science of Social,” a compilation of most of my research work at Lithium. It is re-written for a business audience. As such, it is not designed to go as deep as my blog here (maybe that is the next book, if I ever get the chance to write it!).
Quite a few people have been asking on Twitter how they can get the book. So I thought I’ll briefly answer it here. Since this book is intended to be an easy to read business book, we decided that the main distribution format would be an e-book.
- The Kindle version is available for $4.99 on Amazon
- The iBook version WILL BE available from the Apple store soon (it is pending approval from Apple as we speak)
- However, hard copies of the book (both soft and hard cover) are available from Blurb
NOTE: Any proceeds collected from the sale of these books will be donated to charity, so I’ll keep you up to date on that in the future.
Now, let’s get back to big data. Last time we talked about one of the most important function of analytics – to help us make better decisions.
In order to achieve that, analytics face the challenge of reducing hundreds of terabytes of data down to a few bits, which we can decide and act on. Today, we will describe two of the most commonly used data reduction techniques. These are not new, and you probably are familiar with them already, but I like to briefly mention them for the purpose of completion before we move on to the more advanced techniques later in this mini-series.
If You Know What Data You Need – Search
If you know the data you need to help you make a decision, then the simplest data reduction technique is a search. This turns the data reduction problem into an information retrieval (IR) problem, which we know how to solve very effectively. At the very least, we can leverage open source IR library (i.e. Lucene) or ask Google for help.
Search is arguably the most efficient way for data reduction, but the caveat is that we must know what data we are looking for a priori. Due to its efficiency, search engines can be applied at the web scale to find and retrieve the data we need. This is why Google, Microsoft, Yahoo!, etc. are able to make a business out of their search technology.
What Happens When You Don’t Know the Data You Need?
However, as with many things in life, we often don’t know the data that will best help us with the decision in front of us. In these situations, we often resort to filtering: The process of selectively eliminating the data that are not relevant to our decision. Although the implementation of search and filter technology are quite different, they essentially solve the same problem. At the abstract level, they narrow the data down to a much smaller set that is relevant to our decision. With search, we do it by finding and retrieving the relevant data directly; whereas in filter, we do it by successively removing the irrelevant data, leaving behind the relevant pieces.
Because search is very efficient, we can start with a blank page like Google’s home page and then populate it with more and more relevant data through query refinement. Filtering is less efficient, because it often require showing samples from the entire data set for the user to filter upon in order to remove the irrelevant data. That is, the user has to look through the sample data to determine what’s irrelevant. Therefore, true filtering functions are rarely applied to very large data sets at the web scale.
Blurring the Boundary Between Searching and Filtering
Now, if you are Google, Microsoft, or you simply have lots of computing power, you can fake a filter by having your machines look through all the data and pre-compute attributes on the data set (e.g. date, location, media type, etc.).
Although these pre-computed filters functions like a filter and give user the ability to eliminate irrelevant data, they are really a search, because you must know what data you need before you can apply those filters. For example, you must know a priori, that the relevant data is within the last 24 hour in order to apply that filter. If you don’ know that, you are back to square one. The pre-computed filters won’t help you; you must look at the data in order to determine their relevancy.
In short, pre-computed filters (like those on the left panel of Google) are not real filters; they are really just searches in disguise. And they are implemented as searches underneath the filter-liked user interface. Don’t believe me? You can get the same result simply by specifying the filter conditions as part of your search query or use Google’s advance search.
With modern technologies, the difference between search and filter is really more of an academic distinction. However, it does have some design implications. Since search is much more efficient, when in doubt always apply search first before filtering. Because search often returns a much smaller result set with relatively little effort from the user, we can start with a rather general search and subsequently filter on this smaller data set to find the relevant data. Most successful search engines (i.e. Google) do this. Remember, real filters require the user to examine sample data, determine their relevance, and then remove the irrelevant pieces. In this perspective, query refinement is a form of data filtering. Because users must examine some of the top search results before we know how to refine the query to extract the relevant data we need.
The first step to make big data useful is to identify the relevant data. Clearly the data can’t be useful if it is not even relevant. We typically search and then filter to winnow the big data down to the relevant data set.
Ironically, the relevant data is usually a much smaller data set; in fact, many orders of magnitude smaller. This poses an interesting conundrum, although we have the technology to track, store, and process data at the web scale, most of the data are irrelevant! That is why search technologies were developed hand-in-hand with most big data technologies. Without search and filter technologies, big data is essentially useless.
Alright, in order to not give the spoiler away, I better stop now. Next time we will look at the implication of search and filtering on the value of big data. Stay tune for more on big data analytics.
You must be a registered user to add a comment here. If you've already registered, please log in. If you haven't registered yet, please register and log in.