« Back to the blog main page
May 29, 2012

Book-end Your Data Warehouse on Hadoop and Keep Your Detailed Data Longer!

By Deirdre Mahon, VP of Marketing

Probably the single biggest driver for Hadoop and MapReduce rapid adoption is the fact that the enterprise has seriously questioned how they can sustain the level of investment in traditional data warehouses in order to deliver fast analytics to the business. Of course the obvious close second is how to harness all the unstructured data or interactions including social data now readily available for further business insights. Keeping pace with the volume and growth is certainly one technology challenge but also storing the raw detailed data so that scientists can figure out what is actually happening poses an even greater challenge. Hadoop gives us the ability to scale at low cost and with that the ability to keep the raw or detailed data for longer timeframes so predictive analytics can take place. Last month’s data is useful but having access to the last 5 Q’s of data provides much greater insight.

When you consider the amount of data being generated by enterprises today, it’s actually astounding to also consider than no one is throwing any of it away. In the words of one recent customer when asked, “how long do you keep the data for…?” the answer quickly came back “from now on….”. Financial institutions must keep trade data for 7 years and with annual growth rates of 100%, year 7 is bigger than years 1-6 combined. Figuring out predictable scalability is a very real challenge especially when your IT budget is flat.

Bloor conducted an interesting research project earlier this year called: Database Revolution: Fit for Purpose and it was an excellent way to approach the problem-solution by figuring out exactly what it is you are trying to achieve for the business within constraints you already have, technical, budgetary or resource bound. It seems the most common topic discussed du jour on Big Data and Hadoop is “what are you trying to understand about this new data set and how can you derive value for the business.” Without the right business questions, technology solutions and skills in place, many will struggle.

I think we all agree that the traditional data warehouse is not going away any time soon. Let’s be honest, it’s undergone a couple of decades of investment and has served well in providing near real-time analysis on what is actually happening as well as predicting what will happen and readying the business for the unexpected. However, once the traditional warehouse reaches 20-30 TB’s it becomes compromised in terms of conducting fast analytics while at the same time ingesting volumes of new data coming at “network speed” unless of course you wish to throw more costly resources to solve the problem. Business users that have grown accustomed to rapid response queries, reports and dashboards will quickly become dissatisfied when the warehouse is consumed with writing new data sets.

Most large enterprises have a Hadoop project underway; some further along than others and it’s interesting to see how differently organizations approach their Big Data initiatives. For organizations who have mature BI environments in place and much investment in the data warehouse, questions arise around what do you use Hadoop/MapReduce for and how is the data warehouse best leveraged going forward. In some cases, Hadoop sits alongside the data warehouse augmenting existing BI activities and in other cases; it’s a stand-alone, newly funded project to deliver a distinct set of intelligence. For environments where data sets are very large and the velocity of data creation reaches billions per day, it is not feasible to load the detailed data into the warehouse because of associated costs. Front-ending an existing warehouse with a purpose built database that can ingest billions of transactions, compresse and de-duplicate the data to a much smaller size, while at the same time providing ongoing query through SQL, MapReduce or a BI tool is now a more favored approach.

Once the analysis is conducted on Hadoop/MapReduce or SQL, you can extract aggregates or sub-sets of that data and move to the warehouse for ongoing complex analytics by business analysts. You don’t ever lose the ability to drill into the detailed data in Hadoop so it satisfies your data scientists. Additionally, when the data warehouse reaches capacity or much of the data is less frequently accessed, you can offload on the “other book-end” to a dedicated environment for the history. You avoid putting anything on tape which is highly risky and you can even reinstate to the warehouse down the road, should the need arise.

Cost aside, you want to give the business what they need which is access to the right data at the right time in order to derive the best business value. SQL access on your Hadoop/MapReduce environment has become a key requirement for today’s bank, telco or government institution. Fast ingestion of structured and semi-structured data in the realm of 10’s of billions of daily records is also a key requirement which by the way is a very realistic number for most large banks and communications providers today. I believe fit for purpose is the only way to manage enterprise Big Data so business users can continue to conduct fast analytics. If you can achieve this and control costs, you will realize big gains.