by John Bantleman
I recently had the privilege of attending an internal event at a major US Bank on the topic of Big Data. What differentiated this from many other Big Data events was that the focal point was placed on the issues and requirements surrounding the theme of “Cheap and Deep”. The challenge of storing and retrieving multiple source data over long periods of time cannot be solved if the economic cost is the delimiter in meeting the analytical demand. Through time I have spent with other customers in different industries, it has become clear that “Cheap and Deep” is what everyone is looking for. One major US retailer has a petabyte plus data warehouse but can only store 5Qs of transactional detail. In order to understand what happens when Christmas falls on a Saturday, they need 7 years of data. Another customer, a major US bank has a petabyte plus warehouse but today is forced to keep 3 petabytes on tape with limited (to zero) ability to bring that data back on line to support regulatory and analytical (quant) requirements. A Major Global Investment Bank is seeing trading data volumes grow at close to 100% which is outstripping infrastructure and driving up costs. The resulting increase in regulatory scrutiny and requirements for deeper data analytics is further compounded by the stipulation that this data will need to be available for the next 7-10 years. Homeland security requirements to capture Internet activity and broader communication records also require companies to manage petabytes of data every year. Communications providers who are among the largest data managers on the planet now forecast a 10-100x increase in data volume based upon full implementation of 4G and LTE in the coming two years.
What is clear is that the requirement to manage massive data at scale is a continuing theme. The ability to simply delete, roll to tape, or summarize the data doesn’t match the business imperatives. The solutions that have traditionally been applied to address these needs (data warehouses) are just too expensive.
Interestingly these are less about the unstructured data (variety) capabilities of Hadoop and much more about the ability to scale to an unprecedented Volume and Velocity of data. In many cases we see the attraction of unlimited scalability in Hadoop, which is of genuine interest, but the lack of SQL access and other standard interfaces becomes an issue. In other areas such as scale out NAS and Object Stores, we see the requirement to leverage SQL technology on low cost commodity storage and virtualized (cloud) servers as being a key requirement.
Big Data is clearly about Hadoop, but its not only about the technology stack; the primary consideration should be the ability to store unlimited data over virtually unlimited time periods, which is what businesses now refer to as “Cheap and Deep.” The availability of low cost massively scalable infrastructure along with the ability to access data in a way that satisfies all business needs and requirements, offers a completely different strategy on how data is managed, and it opens the way for organizations to store and manage data without traditional limits!