Mark Smith's Analyst Perspectives

Hadoop Gets Easier with Cloudera Version 3

Posted by Mark Smith on Nov 28, 2010 2:45:19 PM

Managing large volumes of enterprise data continues to challenge IT organizations as they deal with administration and storage of no longer just terabytes but now petabytes of data and costs increase accordingly. This massive size of data complicates the underlying issues of where and how to store it easily in low-cost hardware and manage the data efficiently. One attempt at a solution is Hadoop, an open source community-based project. It began as part of Yahoo and was led by Doug Cutting, who used the MapReduce concepts for large-scale distributed computing to create a distributed file system. Yahoo itself runs the largest deployment of Hadoop. Doug Cutting is not new to the open source world, being involved in the creation of Lucene, open source search technology among many other open source community projects.

The Hadoop Distributed File System (HDFS) enables scaling of data segments across servers. Through such replication across servers it achieves built-in fail-over capability that does not require a redundant array of independent disks (RAID). This technology is recognized across the industry, and many large software companies, including IBM, have announced intentions to support it. Now there is a need for a commercial, licensed version of Hadoop as organizations want more than just a file system, requiring an entire management system to ensure it can operate like other databases in production. This is where Cloudera comes in. In 2008 Doug Cutting saw an opportunity to build a software company around Hadoop that provides licensed and supported versions and also services and training. The company acquired venture financing and now has reference customers that include Bank of America and Samsung.

Cloudera just announced Cloudera Distribution for Hadoop Version 3 at its developer summit. It brings new enterprise-class capabilities for Hadoop and incorporates 11 additional open source projects into this release. Those projects include Oozie for workflow, Pig for dataflow, Hive for SQL query and table support, Flume for streaming data, Sqoop for data integration, Zookeeper for coordination services, and Hue, a user interface framework that provides the Cloudera Desktop. Hadoop uses MapReduce as a parallel data process framework. You can use Hadoop easily through downloading a VMWare image or spinning it up on the Amazon Elastic Compute Cloud (EC2) system that works with Amazon Web Services. Cloudera is led by an industry database veteran, Mike Olson, who has provided his perspective on the advances in version 3.

The Hadoop project has brought other open source software providers to market, such as Pentaho, which is supporting it through data integration and as part of its business intelligence (BI) platform and tools. A new BI provider called GOTO Metrics has announced a platform that utilizes Hadoop to manage collections of information that can be used to provide metrics and other information critical for decision support. These two vendors are not yet official partners OF Cloudera but are supporting Hadoop through its interfaces to the data store.

To investigate Hadoop, you can download the beta release of version 3 from Cloudera without any hassle. I liked the review of this new release by Cutting, who personalizes both it and the underlying technology. To consider whether Hadoop is right for you, Cloudera offers a learning site in its developer center. Keep in mind that while Hadoop is free to download, you will have to invest time and resources into any enterprise production deployment. In that case it makes sense to purchase a support license to have access to experts who can help you in a pinch. All in all, Hadoop is an alternative approach that revolts against what is called “Big Data” and “No SQL.” I think it will challenge not only the giant database vendors IBM, Oracle and Teradata but also the newer ones like Netezza. Cloudera will try to ride the Hadoop wave, and it will be interesting to see how far the company can advance into the market with new customer deployments and expansion of its ecosystem.

Let me know your thoughts or come and collaborate with me on Facebook, LinkedIn and Twitter.


Mark Smith – CEO & EVP Research

Topics: Cloudera, Data, Information Management (IM), Strata+Hadoop

Mark Smith

Written by Mark Smith

Mark is responsible for the overall direction of Ventana Research and drives the global research agenda covering both business and technology areas. He defined the blueprint for Information Management and Performance Management as the linking together of people, processes, information and technology across organizations to drive effective results. Mark is an expert in technology for business from Performance Management, Business Intelligence, Analytics to Information Management across finance, operations and IT. Mark has held CMO, product development and research roles at companies such as SAP, META Group, Oracle and IRI Software. He has experience across major industries including banking, consumer products, food and beverage, insurance, manufacturing, pharmaceutical and retail and consumer services.