By John Bantleman, CEO
As Big Data moves to the mainstream enterprise, one major theme has emerged and is driving the next generation of Big Data solutions; the need to extend beyond the batch limitations of MapReduce and deliver high performance (real-time) analytics. Hadoop is in effect moving from being a file-based system to manage unstructured data to a data operating platform that addresses the broader needs of managing Big Data within the enterprise. A major limitation to growing adoption has been that the core data processing environment, which is massively scalable, but arguably a slow batch system, requiring organizations to essentially move (what could be 100s of Terabytes) to a secondary data warehouse to deliver fast analytics for the business.
RainStor has been leading this charge for quite some time now and over the last year we have seen a number of announcements that make the shift a very clear trend. Announcements from others in the market include newer vendors such as Hadapt and more recently the Hadoop distribution leader Cloudera, with their recent Impala announcement. These innovations are also met by more established vendors including Microsoft who recently announced Polybase. Essentially what these products deliver (or promise to) is high performance analytics directly against data running on the Hadoop platform, with varying approaches, and the common denominator of speed as the driver across all these solutions. The undercurrent across all these initiatives is also the fact that every enterprise IT database admin person has been using SQL for years and it makes no sense to re-invent what has been working well for decades let alone the reality of existing skilled resources.
So with the next generation of Big Data solutions, users will be able to move fluidly between MapReduce jobs where there are faced with a very large data set that needs to be sifted and transformed, through to full SQL based BI tools to access the same data. This essentially removes the need to migrate or move the data to a completely different environment to conduct business analytics. With this come key advantages and a number of benefits; firstly it leverages the low cost scalability of Hadoop, which costs a fraction of proprietary analytical appliances, and more importantly it removes the need to have disparate data environments or silos. Many of the initial use cases where Hadoop has been adopted is the pre-processing engine that then feeds aggregated data into the analytical warehouse which has created a separation from the analytical application and the detailed big data. By mashing up analytics on the same platform there is always the ability to access the detail and perform analytics outside the limits of the summary data that often feeds dashboards and other visualization or BI tools.
From a technical perspective what we are talking about is essentially two access paths to the same data; MapReduce using the more conventional SQL access. Depending upon the question being asked these offer different capabilities and benefits so bringing them together increases the power and capabilities of the Hadoop platform. The next phase will be for these different access paths to converge, the ability to embed SQL within a MapReduce job or for example have a MapReduce user defined function called within a SQL statement. A specific example might include a large telco operator conducting web clickstream analytics where MapReduce would be the ideal tool to perform such a long-running task or jobs putting sessions together based on customer profiles in order to determine which pages were viewed and for how long and then run a SQL query or use a BI tool query where you may ask which customer demographic stayed on a specific web-page for the longest duration. We also believe that this will be extended even further, and an area we are particularly interested in is the ability to interoperate on the same data with MapReduce, SQL and even free text search. We believe that having the most flexible access methods is critical and depending on the query and the user skill set, they can determine the best path and most efficient mechanism to glean the answers they are looking for. Hadoop MapReduce and SQL mashed-up make the perfect enterprise match and if used correctly might just be a heavenly thing.