By John Bantleman
While we mostly agree Big Data is that which constitutes some combination of the 4 V’s – it does have other attributes, which are generally driven by the specific industry and overall business use-case. According to Gartner Group’s Big Data Hype Cycle report issued in July; ”Although Big Data is not necessarily just about MapReduce and Hadoop, many organizations consider these distributed processing technologies to be the only relevant “big data technology,” there are alternatives”. While Hadoop as a platform is taking off, not all enterprises are yet fully enabled to adopt Hadoop in a serious way. On a previous blog, it was actually quite revealing to find out that many of the production Hadoop environments don’t actually have multi-structured data types (the Variety V) for which it was originally architected to manage. Whilst this may change over time, I do also believe that the structured and semi-structured transactions are the heartbeat of most organizations today and therefore the source of the richest enterprise information.
Many organizations have gotten their Hadoop feet wet with use-cases such as pre-processing the data or staging it before loading to a traditional data warehouse or what I previously coined “Raw is More”. Enterprise Hadoop adoption does tend to center around the business challenge of “we’ve got many new sources of data which we need to quickly capture it, retain it in raw form, query and integrate with more traditional data sets to derive new business value”. Very quickly Hadoop requirements to solve this problem become focused on uptime, reliability, security and overall performance or real-time access. In fact, Hortonworks’ recent open source product announcement focused on exactly that. We know Big Data use-cases come in numerous forms and probably the biggest challenge with the 4 V’s definition is that it focuses more on describing attributes of the data itself and less about what you are trying to do with it. Quickly rising to the top of the list is overall query performance and moving away from the batch approach that is MapReduce. Most analytical environments aim for faster analytics performance and predictive analytics whilst now positioned in the plateau of productivity in Gartner’s Big Data Hype Cycle, is something that most have strived towards for many years. Let’s examine a few real-world examples to shed light on the wide range of Big Data use-cases and how business needs vary in terms of data “consumption” at different stages or ages of data. If we look at a large online retailer that needs to track customer behavior in order to determine new target segments for a new product offering, they will likely track current behavior by examining web click-stream data, which products purchased over perhaps the last 2 Q’s, average sales prices and include social or sentiment data sources from Twitter, Facebook or emails which provide a good measure for the marketing organization to determine market fit. By contrast the same retailer might also want to understand seasonal spikes and if they are about to issue a new product during the holiday season they may want to find out what happened when Christmas Eve falls on a Saturday. With this type of analysis, they need to go back 7 years (the last time Christmas Eve fell on Saturday) to determine exact buying patterns. However, the current data warehouse may not have data that old and so we have a problem! If it’s stored on offline tape, good luck retrieving it quickly so now you need to bake-in additional weeks to retrieve, reinstate and analyze it. The historical data in this use-case is really important for accurate decision-making. It’s got nothing to do with what’s happening in the business now, last week or last quarter.
Let’s take a banking example where you have a specific SEC requirement to retain all customer transactions for five years after the account is inactive. This data is required by law to be available for query and audit and so storing it in an environment with current customer data will not only impact storage capacity and hardware but will likely impact overall analytical performance. Performing record-level deletion of those 5-year-old inactive transactions is a unique banking requirement and is quite different to the Big Data problem of storing all transactions for current analytics and KPI’s. In fact storing inactive banking customer data doesn’t even affect the banks ability to drive revenue or margin– it’s an ongoing operational cost that simply cannot be avoided. There are many more big data use-cases that focus on how organizations handle their older, colder data for greater business benefit. Banking, financial services, utilities and communications don’t have as much freedom as they are heavily regulated to keep historical data online for specific timeframes, varying by region. Of course there are use-cases in financial services where older data is actually a very necessary asset for richer decision-making such as high frequency trading environments where quants need access to years of historical data sets in order to build better algorithms. A similar use of older, colder Big Data is a research company who needs historical data to conduct richer research outcomes for say a genome sequencing project, all spurring highly lucrative medical and commercial developments.
There is no question that figuring out the best technology approach to storing data going back 5-10 or even 20+ years is a painful undertaking for IT. All too often I meet with large banks struggling to retain hundreds of terabytes of older customer data in their central data warehouse which becomes very costly even at 10-15k per terabyte (depending on the chosen solution). This is exactly where Hadoop becomes the ideal platform for storing the older, colder data and where scale simply happens. I am looking forward to the day when enterprises employ a well thought out strategy for how they approach information management with the best-fit technologies for business purpose and value.
According to Gartner’s Big Data Hype Cycle report: “Organizations with information management practices in place should not try to apply them as-is to these new information types, as they will likely cause existing governance structures and quality programs to collapse under the additional weight of the information being processed”. And then goes on to state: “Gartner believes that through 2015 organizations integrating high-value, diverse, new information types and sources into a coherent information management infrastructure will outperform their industry peers financially by more than 20%”.
If you want to be bold in your business, employ the best information management technology so business users can continue to access your cold, old data for deeper insights.