Microsoft deepens its commitment to Apache Hadoop and open source analytics
By Arindam Chatterjee Principal Group Product Manager, Azure Databricks
4 min read
DATAWORKS SUMMIT, SAN JOSE, Calif., June 18, 2018 – Earlier today, the Microsoft Corporation deepened its commitment to the Apache Hadoop ecosystem and its partnership with Hortonworks that has brought the best of Apache Hadoop and the open source big data analytics to the Cloud. Since the start of the partnership nearly six years ago, hundreds of the largest enterprises have chosen to use Azure HDInsight and Hortonworks to run Hadoop, Spark and other Open Source analytics workloads on Azure. Also, during this time, Microsoft has become one of the leading committers to Apache projects, sharing its experience running one of largest data lakes on the planet, with the open source community.
Azure HDInsight is a fully managed cluster service that enables customers to process and gain insights from massive amounts of data using Hadoop, Spark, Hive, HBase, Kafka, Storm and distributed R. Azure HDInsight offers the latest Hortonworks Data Platform (HDP) distribution and related Open Source projects on the Linux OS. The service is available in 26 public regions and Azure Government Clouds in the US and Germany.
The Big Data and Hadoop community has rapidly evolved over the past few years. Azure HDInsight has supported these innovations and made them available to enterprise customers by adding support for projects like Apache Spark, Apache Kafka (for scalable data ingestion) and Hive LLAP (for data warehousing and interactive querying) to the service.
Enterprises looking to “lift and shift” their on-premises Hadoop and Spark workloads to Azure HDInsight value the industry leading 99.9 percent availability SLA, integration with Azure Log Analytics, Active Directory and automated self-healing services. With its support for Virtual Network (VNET) based network isolation, transparent data encryption and a wide array of industry standard certifications (incl. GDPR), Azure HDInsight is well-suited for some of the most security and compliance focused enterprises in highly regulated industries such as healthcare, finance, banking etc.
For developers, data scientists and analysts, who are at the core of any big data application, Azure HDInsight offers rich development and debugging capabilities in a tool of their choice; IntelliJ, Eclipse, VSCode, Jupyter and Zeppelin notebooks, and more.
The Hadoop world has always had a rich and varied ecosystem of projects and applications supported by many commercial software providers. Azure HDInsight supports this ecosystem by making the most popular big data applications available on Azure Marketplace to customers through a simple one-click install experience.
“Open source analytics solutions are at the core of Microsoft’s analytics strategy,” said Ryan Waite, Director of Big Data Analytics for Azure. “We are continuing to invest in Azure HDInsight to make it easier and more efficient for customers to adopt Hadoop in Azure. In December 2017, we announced up to a 52 percent price reduction in Azure HDInsight and today we are delighted to reaffirm our commitment to the Hadoop community. Within Microsoft, we have not only adopted Apache YARN for our internal data lake, we have enhanced it to meet the increased scale and efficiency requirements. All of these improvements have been contributed back to the open source community, most recently in the Apache YARN 2.9 release.”
Hadoop on IaaS
While Azure HDInsight is the best suited for customers looking for a PaaS-like service offering where they can focus on their applications, some customers may want to run their own Hadoop/Spark distributions or want more control over how they operate their clusters.
Azure is still the best platform for such do-it-yourself (DIY) customers. With access to a very broad set of VM SKUs, Premium Disks for performance and Reserved Instances or Low Priority VMs for cost control, customers can balance performance and cost within the same Azure environment.
Microsoft is contributing to Hadoop
Microsoft’s commitment to Apache projects in general and Hadoop and Spark go beyond just making these services available on Azure. Services like Azure Data Lake Analytics and the largest internal data lake now run on Apache Hadoop and YARN. Its seven committers have added nearly 200,000 lines of code in YARN, actively contributing back all the learnings and improvements needed to run YARN efficiently and reliably across data center scale clusters (10s of 1000s of nodes). This includes leading the Apache Hadoop 2.9 release where Microsoft contributed new capabilities such as YARN federation (dealing with scale), opportunistic tokens (pushing higher cluster utilization), deadline-based scheduling and resource prediction (enabling SLA bound jobs), together with numerous big fixes and performance improvements.
Get started with Azure HDInsight today
“At Johnson Controls, we use Azure HDInsight for performing real-time and batch analysis of sensor data that we collect from over 6,000 Connected Industrial Chillers deployed across the world. Azure HDInsight offers us the scalability, reliability, high-availability, performance, security and ease of deployment that we need in our production infrastructure to consistently deliver value to our customers,” said Vaidhyanathan Venkiteswaran, Platform Engineering Manager, Data Enabled Business, Johnson Controls.
Come join the many enterprises that are already using Hadoop and Spark on Azure to build a variety of different applications such as batch processing, ETL, Data Warehousing, Machine Learning, IoT and more.
We hope you take full advantage of today’s announcements and are excited to see what you will build with Azure. Read this developer guide to learn more about implementing big data pipelines and architectures on Azure HDInsight. Stay up-to-date on the latest Azure HDInsight news and features by following us on Twitter #HDInsight and @AzureHDInsight. For questions and feedback – please reach out to AskHDInsight@microsoft.com.