Introduction
Large enterprises at the Hadoop Summit in San Jose (June 9-11, 2015) spoke clearly about the need for management and security as they scale up their Hadoop clusters for Big Data Analytics. The scale-ups in many industry giants underscored the importance of Hadoop for gaining business insights, and finding patterns in large data repositories, some with hundreds of millions of records.
At the same time, Hortonworks, which co-hosted this annual Hadoop Summit in San Jose, CA with Yahoo!, made sure to highlight its new product announcement relating to data management and security, as seen in Hortonworks Data Platform 2.3 (HDP 2.3). Key features are updated Ambari for operations and management, Ranger for data security, and Atlas for data governance.
Management and security have been a concern for the Big Data community for several years. As Hadoop gets deployed in more mainstream applications, particularly those applications that are related to regulated industries, the focus on data security becomes critical in any organization considering using Hadoop.
User experiences
The conference highlighted deployment stories from large enterprises (e.g. Verizon, Home Depot, Disney) and large cloud services companies (e.g. Yahoo!, Twitter, eBay, LinkedIn). Managers at the large enterprises have moved beyond pilot programs and proof-of-concept (PoC) projects – leading to data analysis that is changing business processes, including customer service, fraud-detection and logistics/distribution. For many of these organizations, the benefits derived from Hadoop and Big Data can be easily quantified in terms of positive business returns.
Cloud service providers have arguably been leveraging Hadoop longer – often for web analytics to optimize data delivery. Both types of users – in enterprises and cloud services – are scaling up their Hadoop clusters, often to many hundreds of server nodes. By now, the cloud service provider use case is clear: Yahoo! and Hortonworks pioneered the Hadoop technology, and it is widely used at CSPs such as Microsoft, Amazon, Google, Rackspace, as well as Twitter, LinkedIn, and Facebook.
Enterprise Deployments
The emphasis in the marketplace has changed, compared with the 2014 Hadoop Summit conference, which was organized by Hortonworks with multiple conference sponsors. Last year’s conference focused on YARN as a Hadoop technology supporting multiple workloads running simultaneously on a single Hadoop cluster.
This year’s conference focused on improved operation support and data security, a clear sign of the maturation of Hadoop. Those factors support the strong impetus inside enterprises to gather corporate data into centralized “data lakes” – and to adopt Hadoop to analyze that Big Data.
Examples of successful enterprise deployments abound. Having a centralized repository of millions of patient records will be of greater value if more LOBs could harvest the information about medication protocols, and treatment outcomes. Knowing which consumer-goods promotions work—and which don’t—would reduce the opportunity costs of bringing new products to market. Finding the videos that will “go viral” will bring more attention to a company’s marketing campaigns.
Likewise, the ecosystem growing up around Hadoop is growing – including enterprise software providers (e.g., Microsoft, Oracle, SAP, SAS, Splunk, Syncsort, Tableau, Attunity and VMware); enterprise systems providers (e.g., Cisco, Dell, EMC, HP, IBM, Intel, Microsoft, Supermicro), switch companies (e.g. Arista, Brocade) and an array of smaller companies tying their market opportunities to Hadoop (e.g., Attunity, Bright Computing, DataBricks, Dataguise, Datameer Impetus, PepperData). Interestingly, two other providers of Hadoop distributions, Cloudera and MapR, also had booths and breakout talks at the conference.
Hadoop Adoption at Enterprise Sites
The main-tent enterprise customer panel at the Hadoop Summit made these enterprise priorities clear, as executives from Home Depot (retail), Rogers (telecommunications), Schlumberger (oil and gas); and Verizon (wireless services) articulated the business impact of using Hadoop analytics to clearly identify data trends – and doing so rapidly – to cause business change.
Rob Smith, Executive Director of IT, Verizon Wireless, said that Verizon adds billions of records each day, and then finds deep business value in analyzing the transactional data, including social media inputs. “Now we have that holistic view of customer calls, to our call center, that may [ultimately] drive changes to a policy or processes.” Verizon sees business benefits by applying this feedback quickly and by improving business processes.
At Home Depot, timeliness and product inventory drive top-line revenue. “We’re a customer service organization,” Sam Gentsch, Manager, Information Technology, at the Home Depot retail chain, told the Day Three keynote customer panel. “We really want to get the right product at the right time and the right place. That’s what it’s all about, and Hadoop gives us a lot of insight there.”
For Schlumberger, the global oil/gas exploration giant, the entire business is data-intensive, from initial exploration to seismic analysis and global supply chain activities. To the extent that data analysis can be optimized, then business processes can be refined. “Hadoop is a horizontal [technology] for us, Anil Varma, Vice President of Data and Analytics at Schlumberger, told the customer panel. “Slowly, more and more people understand that if they have data upstream and downstream from them, they have more opportunities to optimize.”
Hortonworks Data Platform (HDP) 2.3, Management and Security
Management and security are top improvements in Hortonworks Data Platform (HDP) 2.3. This release supports Apache Ambari 2.1 – an administration, management and monitoring tool for Hadoop that provides Web-based views for Hadoop operators and developers. This is especially important for hybrid cloud environments, linking enterprise and cloud data centers. Neuralytix believes management for efficient processing in cloud-enabled environments is key to meeting enterprise SLAs and enterprise expectations.
Another Hadoop administration tool called Apache Ranger, improves security for applications and data. The HDP 2.3 release supports new levels of encryption for data at rest, and it also extends emerging demand for improved data governance via Apache Atlas.
Best practices will need to be developed, to accompany the new functionality in Hadoop distributions. As was true for earlier generations of security technology, it’s more efficient when customers “start” with a secure environment, and then “open it up” as opposed to attempting to secure it after it has been put in production. And management changes must fit into broader enterprise-wide management frameworks for software-defined infrastructure in datacenters.
What’s Next?
Many sites using Hadoop today plan to change their traditional enterprise data warehouses in 2016 – extracting key data for analysis in Hadoop clusters, or, in some cases, replacing traditional enterprise data warehouses (EDWs) with Hadoop-centered data lakes. These two technologies will live side-by-side, because Hadoop is being used in large enterprises with substantial investments in more traditional IT infrastructure.
It is important to note that EDW and Big Data/Hadoop provide different outcomes. On the one hand, EDW is typically prescriptive and static; in contrast, Hadoop is generally deployed in more dynamic environments to find opportunities that one did not even know existed in the first place.
The latest release of HDP 2.3 is addressing many of these enterprise-led concerns about management and security. As Hadoop technologies mature, and gain wider acceptance, Neuralytix expects more of these “best practices” for Hadoop cluster management to be developed, and deployed, in enterprises.
Hadoop will become more familiar to those working inside – and outside—of the datacenter. As the demographics of the IT world change, and as data scientists leverage Hadoop within enterprise lines of business (LoB), Hadoop will be more widely viewed as a catalyst of change and transformation for business planning. And, it will be viewed as a logical next step for data analysis that sparks new business strategies, extending the spectrum of analytics tools already in place.
The Need for Best Practices
Neuralytix believes that the human side of analytics needs to become front-and-center in these discussions. First, companies must decide who will be “eyes on” via role-based access to the centralized data. Second, managers must set the priorities for Hadoop use among business users, such as data scientists, who can discover and identify “patterns in the data.” Finally, awareness of Hadoop analytics tools must become more generalized among business units.
As the demographics of the IT world change, and as data scientists leverage Hadoop within enterprise lines of business (LoB), Hadoop will be more widely viewed as a catalyst of change and transformation for the business itself. Hadoop will be more widely viewed as a logical next step for data analysis that sparks new business strategies, extending the spectrum of analytics tools already in place.