You are currently browsing the tag archive for the ‘Hadoop’ tag.
I had the pleasure of attending Cloudera’s recent analyst summit. Presenters reviewed the work the company has done since its founding six years ago and outlined its plans to use Hadoop to further empower big data technology to support what I call information optimization. Cloudera’s executive team has the co-founders of Hadoop who worked at Facebook, Oracle and Yahoo when they developed and used Hadoop. Last year they brought in CEO Tom Reilly, who led successful organizations at ArcSight, HP and IBM. Cloudera now has more than 500 employees, 800 partners and 40,000 users trained in its commercial version of Hadoop. The Hadoop technology has brought to the market an integration of computing, memory and disk storage; Cloudera has expanded the capabilities of this open source software for its customers through unique extension and commercialization of open source for enterprise use. The importance of big data is undisputed now: For example, our latest research in big data analytics finds it to be very important in 47 percent of organizations. However, we also find that only 14 percent are very satisfied with their use of big data, so there is plenty of room for improvement. How well Cloudera moves forward this year and next will determine its ability to compete in big data over the next five years.
Cloudera’s technology supports what it calls an enterprise data hub (EDH), which ties together a series of integrated components for big data that include batch processing, analytic SQL, a search engine, machine learning, event stream processing and workload management; this is much like the way relational databases and tools evolved in the past. These features also can deal with the types of big data most often used, according to our research: 40 percent or more use five types, from transactional data (60%) to machine data (42%). Hadoop combines layers of the data and analytics stack from collection, staging and storage to data integration and integration with other technologies. For its part Cloudera has a sophisticated focus on both engineering and customer support. Its goal is to enable enterprise big data management that can connect and integrate with other data and applications from its range of partners. Cloudera also seeks to facilitate converged analytics. One of these partners, Zoomdata, demonstrated the potential of big data analytics in analytic discovery and exploration through its visualization on the Cloudera platform; its integrated and interactive tool can be used by business people as well as professionals in analytics, data management and IT.
Cloudera latest major release with Cloudera Enterprise 5 brought a range of enterprise advancements from in-memory processing, resource management, data management, data protection to name a few. Cloudera offers a range of product options that they announced to make it easier to embrace their Hadoop technology. Cloudera Express is its free version of Hadoop, and it provides three editions licensed through subscription: basic, flex and data hub. The Flex Edition of Cloudera Enterprise has support for analytic SQL, search, machine learning, event stream processing and online NoSQL through the Hadoop components HBase, Impala, Spark and Navigator; a customer organization can have one of these per Hadoop cluster. The Enterprise Data Hub (EDH) Edition enables use of any of the components in any configuration. Cloudera Navigator is a product for managing metadata, discovery and lineage, and in 2014 it will add search, annotation and registration on metadata. Cloudera uses Apache Hive to support SQL through HiveQL, and Cloudera Impala provides a unique interface to the Hadoop file system HDFS using SQL. This is in line with what our research shows organizations prefer: More than half (52%) use standard SQL to access Hadoop. This range of choices in getting to data within Hadoop helps Cloudera’s customers realize a broad range of uses that include predictive customer care, market risk management, customer experience and other areas where very large volumes of information can be applied for applications that were not cost-effective before. With EDH Edition Cloudera can compete directly with large players IBM, Oracle, SAS and Teradata, all of which have ambitions to provide the hub of big data operations for enterprises.
Having open source roots, community is especially important to Hadoop. Part of building a community is providing training to certify and validate skills. Cloudera has enrolled more than 50,000 professionals in its Cloudera University and works with online learning provider Udacity to increase the number of certified Hadoop users. It also has developed academic relationships to promote Hadoop skills being taught to computer science students. Our research finds that this sort of activity is necessary: The most common challenge in big data analytics processes for two out of three (67%) organizations is not having enough skilled resources; we have found similar issues in the implementation and management of big data. The other aspect of a community is to enlist partners that offer specific capabilities. I am impressed with Cloudera’s range of partners, from OEMs and system integrators to channel resellers such as Cisco, Dell, HP, NetApp and Oracle to support in the cloud from Amazon, IBM, Verizon and others.
To help it keep up Cloudera announced it has raised another $160 million from the likes of T. Rowe Price, Michael Dell Ventures and Google Ventures to add to financing from venture capital firms. With this funding Cloudera outlined its investment focus for 2014 which will concentrate on advancing database and storage, security, in-memory computing and cloud deployment. I believe that it will need to go further to meet the growing needs for integration and analytics and prove that it can provide a high-value integrated offering directly as well as through partners. Investing in its Navigator product also is important, as our research finds that quality and consistency of data is the most challenging aspect of the big data analytics process in 56 percent of organizations. At the same time, Cloudera should focus on optimizing its infrastructure for the four types of data discovery that are required according to our analysis.
Cloudera’s advantage is being the focal point in the Hadoop ecosystem while others are still trying to match its numbers in developers and partners to serve big data needs. Our research finds substantial growth opportunity here: Hadoop will be used in 30 percent of organizations through 2015 and another 12 percent are planning to evaluate it. Our research also finds a significant lead for Cloudera in Hadoop distributions, but other options like Hortonworks and MapR are growing. The research finds that the most of these organizations are seeking the ability to respond faster to opportunities and threats; to do that they will need to have a next generation of skills to apply to big data projects. Our research in information optimization finds that over half (56%) of organizations are planning to use big data and Hadoop will be a key focus for those efforts. Cloudera has a strong position in the expanding big data market because it focuses on the fundamentals of information management and analytics through Hadoop. But it faces stiff competition from the established providers of RDBMSs and data appliances that are blending Hadoop with their technology as well as from a growing number of providers of commercial versions of Hadoop. Cloudera is well managed and has finances to meet these challenges; now it needs to be able to show many high-value production deployments in 2014 as the center of business’s big data strategies. If you are building a big data strategy with Hadoop, Cloudera must be in the evaluation priority for an organization.
CEO & Chief Research Officer
Cisco Systems has announced its intent to acquire Composite Software, which provides data virtualization to help IT departments interconnect data and systems; the purchase is scheduled to complete in early August. Cisco of course is known for its ability to interconnect just about anything with its networking technology; this acquisition will help it connect data better across networks. Over the last decade Composite had been refining the science of virtualizing data but had reached the peak of what it could do by itself, struggling to grow enough to meet the expectations of its investors, board of directors, employees, the market and visionary CEO Jim Green, who is well-known for his long commitment to improving data and technology architectures. According to press reports on the Internet, Cisco paid $180 million for Composite, which if true would be a good reward for people who have worked at Composite for some time and who were substantive shareholders.
Data virtualization is among the topics we explored in our benchmark research on information management. Our findings show that the largest barrier to managing information well is the dispersion of data across too many applications and systems, which impacts two-thirds (67%) of organizations. Our research and discussions with companies suggest that data virtualization can address this problem; it is one aspect of information optimization, an increasingly important IT and business strategy, as I have pointed out. Composite has made data virtualization work in technology initiatives including business analytics and big data but overall in information architecture.
Since my last formal analysis that I wrote about Composite, its technology has advanced significantly. One area of focus had been to increase virtualization of big data technologies, realized recently in its 6.2 SP3 release furthering its support of Hadoop. Over the last two years Composite has supported distributions from Apache, Cloudera and Hortonworks, interfacing to MapReduce and Hive since its 6.1 release. In the 6.2 release in 2012, Composite added improved techniques for processing analytics and virtualizing the resulting data sets, which came not only from traditional data sources but also from HP Vertica, PostgreSQL and improved interfaces with IBM and Teradata systems. These releases and expanded data connections have made Composite’s the most versatile data virtualization technology in the industry.
As Composite Software becomes part of Cisco, it is worth remembering that acquiring technology companies has been a major part of building the company Cisco is today; the many acquisitions have expanded its product portfolio and brought together some of the brightest minds in networking technology. Less often noted is Cisco’s mixed success in acquisitions of data-related software for enterprise use. Among the examples here is CxO Systems, which had a complex event processing (CEP) and data processing technology that Cisco eventually used to advance its event processing across network technologies; the software no longer exists in stand-alone form. Another example is Latigent, a provider of call center reporting on detailed call data that was supposed to help Cisco cultivate more services; eventually this software was shown not to fit with Cisco’s technology products or services portfolio and disappeared. Cisco’s business model has been to market and sell to data center and networking environments, and it has not been successful in selling stand-alone software outside this area. It is not easy to compete with IBM, Oracle, SAP and others when it comes to technology with which Cisco lacks experience and will impact its success in monetizing the investment in purchasing Composite Software.
Emerging computing architectures will change the way IT departments will operate in the future, but it’s an open question who will lead that change from traditional IT software stacks to hardware and networking technology. Cisco plans to transform the data center and networking from software and hardware stacks, partly through virtualizing software inside premises and out into the Internet and cloud computing. We see IT architectures changing in a haphazard, vendor-driven approach while most CIOs wants to revise their architectures in an efficient manner. Data and information architectures have moved from relying on databases and applications to adapt innovative technology such as in-memory processing and grid architectures that support new trends in big data, business analytics, cloud computing, mobile technology and social collaboration. These technology innovations and pressure to use them in business are straining the backbones of enterprise networks that have to adapt and grow rapidly. Most companies today still have data centers and networking groups and are increasing their network bandwidth and buying new networking technology from Cisco and others. This will not change soon, but they need better intelligence on the networking of data in these technologies. This I think is what Cisco is looking for with the acquisition of Composite, hoping to refine the infrastructure to help in virtualization of data across the network. This approach can play well into what Cisco articulates as a strategic direction around the concept it calls the Internet of Everything (IoE). The challenge will be to convince CIOs that this energetic of a change by Cisco is necessary for improving connectivity of data across the network. Data virtualization will play a key role in efficiently interconnecting networks and systems but where this is primarily applied within the data center of networking compared to where it is centered in the stack of information management technology today.
I think the acquisition makes sense for Composite Software as it could not grow fast enough to satisfy its technological ambitions and needed more deployments to support its innovations. Cisco for its part will be tested over time to determine what it makes of Composite. I expect that Cisco will fundamentally change its roadmap and support for enterprise IT as it adjusts the technology to work with its data center and networking technology; such a change inevitably will impact its existing customers and partners. The company will place Composite in its Services organization (which I was told includes software); I take this as an indication of more maturity needed by Cisco to bring software acquisitions into its product and services portfolios and manage as software with the right level of dedicated marketing and sales resources. For those who want to learn more about data virtualization, I recommend reading the educational content of Composite in the next couple of months before its website and material disappear from the Internet. This may sound harsh but is how companies like Cisco and others digest acquisitions and eliminate the companies past and by doing so eliminate critical information. No one is saying so and could not get an answer, but the reality is that Composite will become one item in the long product and services list on Cisco’s website; my guess is it will become part of the data center and virtualization services or unified computing services; to see how many Cisco has, just check its website.
Data virtualization is just beginning to mature and attract CIOs and IT professionals who care about the ‘I’ in their titles that stands for information. Composite Software deserves credit for its focus on data virtualization. Our research shows that data virtualization is an advancing priority in information management but ranks lower than others among initiated projects (11%), as the chart shows. Composite’s technology arrived ahead of the full maturing of IT organizations limiting its full potential. Within Cisco its people will have to adapt to the new owner’s wishes, which will diffuse some of its database-centered virtualization to focus more on the network, unless Cisco decides to expand its and Composite’s combined R&D team.
Congratulations to Composite Software for its innovative contributions and sad to see its independence and passion for data virtualization to disappear. I will be watching to see if Cisco understands the company’s real value, not just for data centers and networking but for bridging to enterprise IT and the market for information optimization.
CEO & Chief Research Officer