You are currently browsing the tag archive for the ‘Hadoop’ tag.
Big data has great promise for many organizations today, but they also need technology to facilitate integration of various data stores, as I recently pointed out. Our big data integration benchmark research makes it clear that organizations are aware of the need to integrate big data, but most have yet to address it: In this area our Performance Index analysis, which assesses competency and maturity of organizations, concludes that only 13 percent reach the highest of four levels, Innovative. Furthermore, while many organizations are sophisticated in dealing with the information, they are less able to handle the people-related areas, lacking the right level of training in the skills required to integrate big data. Most said that the training they provide is only somewhat adequate or inadequate.
Big data is still new to many organizations, and they face challenges in integrating big data that prevent them from gaining full value from their existing and potential investments. Our research finds that many lack confidence in processing large volumes of data. More than half (55%) of organizations characterized themselves as only somewhat confident or not confident in their ability to accomplish that task. They have even less confidence in their ability to process data that arrives at high velocity: Only 29 percent said they are somewhat confident or not confident in that. In dealing with the variety of big data, confidence is somewhat stronger, as more than half (56%) declared themselves confident or very confident. Assurance in one aspect is often found in others: 86 percent of organizations that said they are very confident in their ability to integrate the variety of big data are satisfied with how they manage the storage of big data. Similarly 91 percent of those that are confident or very confident with their data quality are satisfied with the way they manage the storage of big data.
Turning to the technology being used, we find only one-third (32%) of organizations satisfied with their current data integration technology, but twice as many (66%) are satisfied with their data integration processes for loading and creating big data. A substantial majority (86%) of those very confident in their ability to integrate the needed variety of big data are satisfied with their existing data integration processes. Those that are not satisfied said the process is too slow (61%), analytics are hard to build and maintain (50%) and data is not readily available (39%). These findings indicate that making a commitment to data integration, for big data and otherwise, can pay off in confidence and satisfaction with the processes for doing it. Additionally, organizations that use dedicated data integration technology (86%) are satisfied much more often than those that don’t use dedicated technology (52%).
New types of big data technologies are being introduced to meet expanding demand for storage and use of information across the enterprise. One of those fast-growing technologies is the open source Apache Hadoop and commercial enterprise versions of it that provide a distributed file system to manage large volumes of data. The research finds that currently 28 percent of organizations use Hadoop and about as many more (25%) plan to use it in the next two years. Nearly half (47%) have Hadoop-specific skills to support big data integration. For those that have limited resources, open source Hadoop can be affordable, and to automate and interface with it, adopters can use SQL in addition to its native interfaces; about three in five organizations now use each of these options. Hadoop can be a capable tool to implement big data but must be integrated with other information and operational systems.
Big data is not found only in conventional in-house information environments. Our research finds that data integration processes are most often applied between systems deployed on-premises (58%), but more than one-third (35%) are integrating cloud-based systems, which reflects the progress cloud computing has made. Nonetheless, cloud-to-cloud integration remains least common (18%). In the next year or two 20 to 25 percent of organizations plan additional support for all types of integration; those being considered most often are cloud-to-cloud (25%) and on-premises-to-cloud (23%), further reflecting movement into the cloud. In addition, nearly all (95%) organizations using cloud-to-cloud integration said they have improved their activities and processes. This finding confirms the value of integration of big data regardless of what types of systems hold it. With a growing number of organizations using cloud computing, data integration is a critical requirement for big data projects; more than one-quarter (28%) of organizations are deploying big data integration into cloud computing environments.
Because of the intense need of business units and process for big data, integration requires IT and business people to work together to build efficient processes. The largest percentage of organizations in the research (44%) have business analysts work with IT to design and deploy big data integration. Another one-third assign IT to build the integration, and half that many (16%) have IT use a dedicated data integration tool. The research finds some distrust in involving the business side. Almost one in four (23%) said they are resistant or very resistant to allowing business users to integrate big data that IT has not prepared first, and the majority (51%) resist somewhat. For more than half (58%) the IT group responsible for BI and data warehouse systems also is the key stakeholder for designing and deploying big data integration; no other option is used by more than 11 percent.
It is not surprising that IT is the department that most often facilitates big data and needs integration the most (55%). The most frequent issue arising between business units and IT is entrenchment of budgets and priorities (in 42% of organizations). Funding of big data initiatives most often comes from the general IT budget (50%); line-of-business IT budgets (38%) are the second-most commonly used. It is understandable that IT dominates this heavily technical function, but big data is beneficial only when it advances the organization’s goals for information that is needed by business. Management should ensure that IT works with the lines of business to enable them to get the information they need to improve business processes and decision-making and not settle for creating a more cost-effective and efficient method to store it.
Overcoming these challenges is a critical step in the planning process for big data. My analysis that big data won’t work well without integration is confirmed by the research. We urge organizations to take a comprehensive approach to big data and evaluate dedicated tools that can mitigate risks that others have already encountered.
CEO and Chief Research Officer
I had the pleasure of attending Cloudera’s recent analyst summit. Presenters reviewed the work the company has done since its founding six years ago and outlined its plans to use Hadoop to further empower big data technology to support what I call information optimization. Cloudera’s executive team has the co-founders of Hadoop who worked at Facebook, Oracle and Yahoo when they developed and used Hadoop. Last year they brought in CEO Tom Reilly, who led successful organizations at ArcSight, HP and IBM. Cloudera now has more than 500 employees, 800 partners and 40,000 users trained in its commercial version of Hadoop. The Hadoop technology has brought to the market an integration of computing, memory and disk storage; Cloudera has expanded the capabilities of this open source software for its customers through unique extension and commercialization of open source for enterprise use. The importance of big data is undisputed now: For example, our latest research in big data analytics finds it to be very important in 47 percent of organizations. However, we also find that only 14 percent are very satisfied with their use of big data, so there is plenty of room for improvement. How well Cloudera moves forward this year and next will determine its ability to compete in big data over the next five years.
Cloudera’s technology supports what it calls an enterprise data hub (EDH), which ties together a series of integrated components for big data that include batch processing, analytic SQL, a search engine, machine learning, event stream processing and workload management; this is much like the way relational databases and tools evolved in the past. These features also can deal with the types of big data most often used, according to our research: 40 percent or more use five types, from transactional data (60%) to machine data (42%). Hadoop combines layers of the data and analytics stack from collection, staging and storage to data integration and integration with other technologies. For its part Cloudera has a sophisticated focus on both engineering and customer support. Its goal is to enable enterprise big data management that can connect and integrate with other data and applications from its range of partners. Cloudera also seeks to facilitate converged analytics. One of these partners, Zoomdata, demonstrated the potential of big data analytics in analytic discovery and exploration through its visualization on the Cloudera platform; its integrated and interactive tool can be used by business people as well as professionals in analytics, data management and IT.
Cloudera latest major release with Cloudera Enterprise 5 brought a range of enterprise advancements from in-memory processing, resource management, data management, data protection to name a few. Cloudera offers a range of product options that they announced to make it easier to embrace their Hadoop technology. Cloudera Express is its free version of Hadoop, and it provides three editions licensed through subscription: basic, flex and data hub. The Flex Edition of Cloudera Enterprise has support for analytic SQL, search, machine learning, event stream processing and online NoSQL through the Hadoop components HBase, Impala, Spark and Navigator; a customer organization can have one of these per Hadoop cluster. The Enterprise Data Hub (EDH) Edition enables use of any of the components in any configuration. Cloudera Navigator is a product for managing metadata, discovery and lineage, and in 2014 it will add search, annotation and registration on metadata. Cloudera uses Apache Hive to support SQL through HiveQL, and Cloudera Impala provides a unique interface to the Hadoop file system HDFS using SQL. This is in line with what our research shows organizations prefer: More than half (52%) use standard SQL to access Hadoop. This range of choices in getting to data within Hadoop helps Cloudera’s customers realize a broad range of uses that include predictive customer care, market risk management, customer experience and other areas where very large volumes of information can be applied for applications that were not cost-effective before. With EDH Edition Cloudera can compete directly with large players IBM, Oracle, SAS and Teradata, all of which have ambitions to provide the hub of big data operations for enterprises.
Having open source roots, community is especially important to Hadoop. Part of building a community is providing training to certify and validate skills. Cloudera has enrolled more than 50,000 professionals in its Cloudera University and works with online learning provider Udacity to increase the number of certified Hadoop users. It also has developed academic relationships to promote Hadoop skills being taught to computer science students. Our research finds that this sort of activity is necessary: The most common challenge in big data analytics processes for two out of three (67%) organizations is not having enough skilled resources; we have found similar issues in the implementation and management of big data. The other aspect of a community is to enlist partners that offer specific capabilities. I am impressed with Cloudera’s range of partners, from OEMs and system integrators to channel resellers such as Cisco, Dell, HP, NetApp and Oracle to support in the cloud from Amazon, IBM, Verizon and others.
To help it keep up Cloudera announced it has raised another $160 million from the likes of T. Rowe Price, Michael Dell Ventures and Google Ventures to add to financing from venture capital firms. With this funding Cloudera outlined its investment focus for 2014 which will concentrate on advancing database and storage, security, in-memory computing and cloud deployment. I believe that it will need to go further to meet the growing needs for integration and analytics and prove that it can provide a high-value integrated offering directly as well as through partners. Investing in its Navigator product also is important, as our research finds that quality and consistency of data is the most challenging aspect of the big data analytics process in 56 percent of organizations. At the same time, Cloudera should focus on optimizing its infrastructure for the four types of data discovery that are required according to our analysis.
Cloudera’s advantage is being the focal point in the Hadoop ecosystem while others are still trying to match its numbers in developers and partners to serve big data needs. Our research finds substantial growth opportunity here: Hadoop will be used in 30 percent of organizations through 2015 and another 12 percent are planning to evaluate it. Our research also finds a significant lead for Cloudera in Hadoop distributions, but other options like Hortonworks and MapR are growing. The research finds that the most of these organizations are seeking the ability to respond faster to opportunities and threats; to do that they will need to have a next generation of skills to apply to big data projects. Our research in information optimization finds that over half (56%) of organizations are planning to use big data and Hadoop will be a key focus for those efforts. Cloudera has a strong position in the expanding big data market because it focuses on the fundamentals of information management and analytics through Hadoop. But it faces stiff competition from the established providers of RDBMSs and data appliances that are blending Hadoop with their technology as well as from a growing number of providers of commercial versions of Hadoop. Cloudera is well managed and has finances to meet these challenges; now it needs to be able to show many high-value production deployments in 2014 as the center of business’s big data strategies. If you are building a big data strategy with Hadoop, Cloudera must be in the evaluation priority for an organization.
CEO & Chief Research Officer