By Deirdre Mahon, VP of Marketing
It’s interesting to watch how large enterprises roll out Hadoop and its related technology stack in order to provide the business with access to rich new data. Also combining that with existing data, which predominantly lives in a data warehouse is by no means an easy challenge for both the business and IT group.
Before Hadoop, storing and managing all the raw detailed data was impossible for a number of reasons; depending on speed of data creation, ingest or writing to the warehouse is technically challenging and assuming your growth rates are above 50% per year, you reach scale issues after a few years. Today’s relational-based data warehouses are ideal for deep analytics against data from multiple source systems typically accessed by business analysts who need to track trends and KPI’s. Cost of scale is the prime economic driver and multi-structured raw data access is the technology enabler driving Hadoop adoption in the enterprise. Being able to keep all the detailed data not only helps the data scientists figure out what exactly is going on but also satisfies the business demand for better predictability which of course gives competitive advantage.
Front-ending existing data warehouses with a dedicated data hub for all the raw detailed data is becoming a popular architectural approach. The business analyst wants access to the EDW for various canned reports but they also need to understand what is happening in the market place where they don’t even know the question to ask. The basis of data warehousing is really built upon the reverse of raw data: unleash a bunch of smart business analysts, spend a few years looking at the problem seven ways from Sunday, and build a model that can answer the questions you believe the business user wants.
These scenarios don’t work for everything because they involve stars, dimensions, aggregates and rollups and don’t satisfy everything the business wants and needs – access to the raw data to better understand the true patterns of what is really going on. Organizations are moving away from aggregation of facts and stats to understanding real behavior and what’s going on at the present moment. With raw detail you can see behavior, almost as it is happening. Access to raw data is Big Data.
As your data growth continues, you scale by adding more servers to manage the Hadoop cluster but also having security and built-in availability not just at the hardware level but at the data level is critical for today’s enterprise. Accessing the data in Hadoop using both MapReduce and SQL has become a key requirement as users demand faster analysis and response times. At last week’s Hadoop Summit 2012 in San Jose, CA, there was an excellent presentation given by Zion’s Bank about their security data warehouse running on Hadoop where the presenter clearly stated – keep everything! All the detail because as soon as you decide not to keep certain fields or tables, the business will need it for a specific query. Whether your driver is economic or technical – be sure to investigate your capacity and scale requirements because Raw is More and will give you that competitive edge to understand what is truly happening.