John Bantleman
Today’s enterprise is faced with collecting, managing and analyzing vast quantities of data, coming from multiple sources at an extremely fast rate. Setting aside the oft talked about social media or sentiment data, which certainly contribute to the Big Data cause, it’s not uncommon to see multi-terabyte volumes on a daily basis inside a traditional enterprise. Facebook now touted as one of the largest Hadoop production environments generating 10’s of terabytes each day amassing a farm of 2,000 machines holding 12 PB’s is not your typical everyday architecture. However, it’s not too unusual to see 500 TB environments with a couple of terabytes of data creation every day looking at Hadoop to lower cost and scale with better predictability. Financial trade data, RFID tags, web clickstreams, network logs and communications detail records (aka machine generated data) all contribute to the daily onslaught and all of which must be collected, stored and analyzed to some degree (see TDWI survey results in figure below). Most financial services institutions experience 100% annual data growth that can no longer be sustained using traditional RDBMS and data warehouse products and therefore are the key reason we see rapid adoption of Hadoop.
At the same time, vast data has created new and ever-changing demands from business users who want access to both historical and current data regardless of type or volume. Tasked with understanding what happened in the last 30-minutes, what happened at the same time last year and how it compared with what happened say 3 years ago along with the requirement to have quick responses for better decisioning is an increasing requirement. While Hadoop has definitely helped meet new business demands over the past couple of years, it’s approach is largely batch which means that even simple queries can take hours especially in a Big Data Hadoop / MapReduce environment. In a world where decisions are made in minutes and even seconds, business users want this more and more, regardless of the platform and technology solution stack.
Hive has emerged as a promising technology complement to Hadoop for queries, but it is still batch and not only requires specific training but potentially manual professional services time to convert all those existing SQL statements running in the enterprise. MapReduce and Pig enable your data scientists to perform investigative analytics and are ideal tools for questions such as “show me a graph for all the Web pages visited by a segment of customers in the last month, how long they stayed on each page, at what point they clicked off and the time lag between subsequent visits”. Whereas a question as simple as “show me volume of visitors by day for a specific Web page” could be easily executed using standard SQL and with a much faster response rate compared to a Pig script. We are seeing a lot more engineering innovation now around how to make Hadoop / MapReduce more “user-friendly” in addition to speeding up performance. Hadoop adoption will increase as the entire ecosystem of providers make it not only visually and functionally easier to use but more importantly as the speed of response time improves from hours to minutes or minutes to seconds. This is the next big hurdle in Hadoop-based environments.
Organizations have a huge opportunity to improve their bottom line by enabling business users direct access to Big Data. Enabling this without compromise in performance is the next big thing. We have figured out how to manage the vast and now fast is where the concentration of engineering effort is taking place. Fast means speed which of course could mean different things to different businesses but essentially the ability to intercept events that are potentially harmful or incur risk as they are taking place vs analyzing the impact after the event has taken place. Vast without fast means the business value is harder to quantify.

