You are currently browsing the tag archive for the ‘Hadoop’ tag.
Data is a commodity in business. To become useful information, data must be put into a specific business context. Without information, today’s businesses can’t function. Without the right information, available to the right people at the right time, an organization cannot make the right decisions nor take the right actions, nor compete effectively and prosper. Information must be crafted and made available to employees, customers, suppliers, partners and consumers in the forms they want it at the moments they must have it. Optimizing information in this manner is essential to business success. Yet I see organizations today focusing on investments in big data because they believe it can effortlessly bring analysts insights. That premise is incorrect.
Organizations must do everything they can to ensure that they can turn content and data into information. Just loading data into a new technology – in this case big data – won’t do the trick. The process requires people who know how to get the right data and put it into the right forms, and is more art than science. Indeed, the challenge becomes more difficult as economic and competitive pressures make it harder to enlist qualified data scientists. Accommodating to this situation – being able to act faster and smarter while relying on fewer human assets – requires properly configured, easy and immediate access to the information needed to address a business need.
The typical model in organizations – a request to an analyst team for answers, copying and pasting from existing reports and spreadsheets into a single document, or maybe a request to the IT organization for reports or changes to how data is presented, then gathering other documents and information about the situation on the Internet or internally – is far too slow, and in fact our technology innovation research shows that 44 percent of organizations identify data-related tasks as barriers to spending more time on analytic ones. Organizations need their information to be available to anyone authorized on demand, in the form needed and through the channel requested, regardless of how or where it is stored. Not being able to deliver the right information impacts the quality of decisions made and actions taken and weakens the business processes on which teams and individuals rely. Data continues to be spread across too many applications and systems, as our information management research found in 67 percent of organizations. Continuing down this path is a waste of time and resources and impedes analysts and business in their work to assess and guide optimized efforts.
This challenge has increased interest in new information optimization technology, which we plan to study as part of our research agenda on big data and information management. Such technology can collect data in any format, assemble and integrate it as needed and enable individuals to access and work with it as information in the desired forms. The data can take any of a wide array of forms, structured or not, and may originate in a report or a document or be stored in any of a variety of business intelligence tools, applications and database systems. Information optimization technology can access that data and present it dynamically in ways simple and flexible enough for anyone to use it without training or assistance from others.
Under the hood, though, careful planning is required to satisfy the broad range of information demands both inside and outside of the enterprise. Organizations must find, evaluate and put in place the technology and platforms to assemble and present the information. This is not just a data integration challenge that might help with moving structured data across information systems, but rather a broader undertaking to empower business with the ability to capture information from many systems and sources.
An information optimization platform must be able to handle the volume of information requests from groups and individuals based on all the data residing inside and outside the organization. Moreover, as these are certain to increase, it must offer scalability and the potential to meet users’ increasing performance needs. To satisfy the many types of users and skill levels, it must rank high for usability, offer flexibility in user interfaces and be accessible from many applications, portals and mobile devices. In short, the platform must become an effective foundation element of the organization’s information architecture and support a range of needs that can span from capturing data, modeling and converting it while maintaining security, automating the process and sharing and distributing information.
Understanding the need for and choosing technology to support information optimization is not easy. Information optimization is a new focus in the enterprise software market, a new segment that builds on existing investments in business intelligence, reports, business applications, content and document management, information systems and information management, and benefits from recent advances in business analytics and big data to lift them to a higher level of value and use. Building on our past research on information applications and information management we are examining what is necessary for organizations to deliver unified information faster and better than ever before.
Today, many organizations lack the skills, process and technology to optimize the use of information. Our technology innovation research found that more than half of organizations (51%) say that not having enough skilled resources is the major barrier to advancing their competencies to use new and innovative technology. Traditional IT architectures have not advanced to meet the business needs for information optimization solutions today, and since information optimization is not easy, organizations are facing challenges in identifying the right types of integration technologies, adapting them to their particular needs and assembling information for business use.
The advent of big data technologies such as Hadoop provides new and rich repositories that can be exploited if technology is designed to access them. Organizations can choose from a vast array of new big-data-oriented technologies, including Hadoop, in-memory databases, data appliances and RDBMSes. In addition accessing information from across the Internet including information on customers or feeds from distributors and suppliers that could be specific sources to an organizations industry or even online systems and applications operating in cloud computing environments are essential. They must apply the right level of security to ensure that data is protected and made available only to those authorized to see it . All this must happen with information that is not static but proceeds along a workflow and must be available when people need to collaborate. Also, information should be optimized to make it accessible from mobile technology and easily found via search technology.
Organizations are taking dramatically different approaches to optimizing the use of information to make it available on demand from any source at any time. Designing more effective information processes is critical to improving the maturity of organizations with regard to information optimization and the use of the rich content, data, documents and reports, which are key elements of enterprise use of information. Considering the benefits of using technology to enable information optimization and taking advantage of big data and existing investments is critical, and our technology innovation research finds that business improvement initiatives are leading the way in 60 percent of organizations.
We have announced new research as part of our direct investigation on information optimization that will search out and document emerging best practices in information optimization from early adopters, and will determine the extent to which companies have adopted or plan to implement them. We will investigate the IT infrastructures and information-related needs of organizations’ business areas to help establish the real value and business case for investment in information optimization.
CEO and Chief Research Officer
The big-data landscape just got a little more interesting with the release of EMC’s Pivotal HD distribution of Hadoop. Pivotal HD takes Apache Hadoop and extends it with a data loader and command center capabilities to configure, deploy, monitor and manage Hadoop. Pivotal HD, from EMC’s Pivotal Labs division, integrates with Greenplum Database, a massively parallel processing (MPP) database from EMC’s Greenplum division, and uses HDFS as the storage technology. The combination should help sites gain from big data a key part of its value in information optimization.
Greenplum and EMC have been working with Hadoop technology to provide robust database and analytic technology offerings. EMC is using Hadoop and HDFS as a foundation to support a new generation of information architectures, on top of which the company provides a value-added layer of data and analytic processing to support a range of big data needs. The aim is to address one of the benefits of big data technology, which is to increase the speed of analysis; our big data benchmark research found that to be a key benefit for 70 percent of organizations.
EMC is placing a bet by building its distribution on top of Apache Hadoop 2.02, which has yet to be officially released. The company is testing its software on a thousand-node cluster to ensure it will be ready. While EMC calls Pivotal HD the most powerful Hadoop distribution, it is one of many new providers that are building on Hadoop technologies and commercializing it for organizations looking for direct support and services or looking for value-added technology on top of Hadoop. Oddly, however, EMC’s new offering appears to be competitive with its own licensing of MapR for a product it calls Greenplum MR.
EMC is calling the advanced database processing technology with Pivotal HD a new name of HAWQ. It provides the ability to use ANSI SQL in an optimized manner against big data through a query parser and optimizer with its own HAWQ nodes process query execution against HDFS data nodes. HAWQ also has its own Xtension Framework for adaptability to other technologies. HAWQ improves upon the performance of regular SQL as it is a specialized technology to manage distributed and optimized queries to data in Hadoop.
By supporting SQL as the language to get to Hadoop, HAWQ simplifies standardized access to big data through this approach that provides query optimization through its query planning and pipelining methods. Providing a SQL interface and an ODBC connection is not new; many Hadoop distributions now provide ODBC connectivity, including Cloudera, Hortonworks and MapR. EMC, however, uses its optimized query and SQL connection in HAWQ as an accelerator, which lets it stack its software technology up against any data and analytic technology, not just Hadoop. The question for organizations thinking about making an investment in this approach is whether they are limiting their access to future Hadoop advancements by investing in HAWQ technology that operates with only the Pivotal HD distribution or does the gains provide immediate value to separate any Hadoop challenges in optimizing its infrastructure. It is my belief that if an organization adopts this path of HAWQ, it will need to ensure it invests in an information architecture that includes integration technology at the HDFS level, as businesses will inevitably be operating against varying flavors of Hadoop.
Another area of differentiation EMC promises for HAWQ is in the area of performance. EMC claims exponential performance improvement using its query optimizer and SQL versus using Hive to access HDFS or Cloudera Impala and native Hadoop. In fact it claims 19 to 648 times faster performance using its own benchmark. Since these benchmarks were not run independently, it is hard to place significant value in them for now. I made inquiries to many Hadoop software providers, including Cloudera, and they said these metrics are probably not that accurate and invited performance comparisons against their technologies. Clearly these benchmarks should have been released to the Hadoop community for its members to design optimized queries using Hive for more accurate comparisons, but EMC is hoping that its results will entice IT professionals to try it for themselves.
EMC’s stature in the market and its work with a broad range of technology partners makes it an important player in the big data market. Tableau Software is one of those partners, providing discovery on data from HAWQ and Pivotal HD for analytics. Cirro also announced support for Pivotal HD, enabling a new generation of what I call big data integration. These partners are good examples and provide EMC a more complete stack of technologies for operating in a more enterprise approach for big data from analyst to connectivity to other data sources.
EMC can deploy its big data technology across a variety of deployment methods, including public cloud with OpenStack and Amazon Web Services (AWS), private cloud using VMware, and on-premises. Our big data research shows faster growth planned for hosted (59%) and software as a service (65%) than for future on-premises deployments. While EMC is not allowed to publicly mention its customer references, and I have yet to validate them, the company says they include some of the largest banks and manufacturers.
Meanwhile, the Hadoop community’s new project Tez provides an alternative to bypass MapReduce to improve performance. It uses Hadoop YARN for a more efficient run time and better performance for queries. Also, the Stinger Initiative is a project to improve interactive query support for Hive.
EMC acknowledges open source efforts that focus on improving the performance of accessing HDFS and look forward to those advancements and where they can be extracted into its Pivotal HD product but points to its query optimizer and ANSQ SQL as a better approach. It also did not deny that its performance comparisons could have been more optimized. But EMC is betting that its HAWQ efforts and its reliance on the next release of Apache Hadoop 2 will place it in a good market position, leveraging open source technology that is expected to be released in 2013.
This move to introduce Pivotal HD Enterprise and HAWQ is clearly an opportunity to accelerate EMC’s efforts. Greenplum’s technology needed assistance to grow its adoption as it competes with approaches that encompass not only Hadoop but also in-memory, appliance and RDBMS technology. Only time will tell how EMC’s focus on big data with Pivotal HD and HAWQ will play out. The battle among big data providers continues to be very competitive, with dozens of approaches. As each company moves from experimentation to development to production, it must carefully determine what technology will best meet its unique needs. Organizations should evaluate HAWQ and Pivotal HD on not just the merits of performance or providing SQL access but on the architectural and management needs of IT that span from adaptability, manageability, reliability and usability and the business value that should be ascertained with this technology compared to other Hadoop and big-data technology approaches.
CEO & Chief Research Officer