• 7 min read

Behind the scenes of Azure Data Lake: Bringing Microsoft’s big data experience to Hadoop

Microsoft just announced Azure Data Lake, a set of big data storage and analytics services including Azure HDInsight that enables developers, data scientists and analysts to perform all types of processing and analytics on data of any size or shape and across multiple platforms and programming languages.

Microsoft just announced Microsoft Azure Data Lake, a set of big data storage and analytics services including Azure HDInsight that enables developers, data scientists and analysts to perform all types of processing and analytics on data of any size or shape and across multiple platforms and programming languages. (See Azure Data Lake landing page, Azure Data Lake signup page, Azure Data Lake blog.)

Microsoft is a data-driven company that has been using big data extensively for many years, and we now operate some of the largest big data services in the world. Our Cosmos service manages exabytes of diverse data (ranging from clickstreams and telemetry to documents, multimedia and tabular data) in clusters that each span in excess of fifty thousand machines. Thousands of developers within Microsoft have been using Scope, a SQL-like language for scale-out data processing based on Dryad for optimizing joins. Scope naturally supports Map-Reduce style programming through an enhanced GROUP BY operator that allows C# code to reduce partitions. It also features beautiful developer tools in Visual Studio that greatly aid programmer productivity; see this video.

Hundreds of thousands of jobs, and even more interactive queries, are run every day on the same shared environments, driving hundreds of PBs of daily I/O. Our internal customers are Microsoft businesses, ranging from very large (and well known!) users such as Bing, Office 365, Skype, Windows, and Xbox Live to numerous small and medium-sized groups within the company.

On a personal note, I came to Microsoft three years ago from Yahoo!, where I had worked closely with Hadoop and many open source tools, and wanted to bring the power of this ecosystem to Microsoft’s big data efforts.  When I came here and saw how people were using Cosmos and Scope, the thing that struck me was how productive engineers and analysts were. The ease of use of the tools and the extent to which people shared data, made possible by the scalable environments that enabled many large datasets to be operated on by many users, were central to this productivity and experimentation.  Ever since, our goal has been to marry the productivity of Microsoft’s internal ecosystem with the openness and flexibility of the Hadoop ecosystem. This has been an amazing journey, and a great opportunity to learn what real customers want when they use big data platforms and services.  We are bringing our experience in building, running (and using!) all these platforms and services to the new Azure Data Lake.

An important development in recent years is that Microsoft has committed to open source software, in particular Apache Hadoop. Internally, many groups use Hadoop, including Hive, HBase, Kafka, Storm, and Spark, with Apache YARN underneath.

Externally, we offer the HDInsight managed Hadoop service, which is one of the most popular and rapidly growing Azure services. The power of the Hadoop ecosystem flows from the ability to access all your data via HDFS through a variety of analytic tools that run on top of a shared environment. The broad adoption of HDFS and YARN for standardized storage and resource management is a key to Hadoop’s extensibility. By using the interfaces to these components as “sockets” to plug into, a wide array of analytic engines has emerged, built as “application masters” or AMs in the YARN framework. , now part of Azure Data Lake.

Azure Data Lake is built to be part of the Hadoop ecosystem, using HDFS and YARN as key touch points. The Azure Data Lake Store is optimized for Azure, but supports any analytic tool that accesses HDFS. Azure Data Lake uses Apache YARN for resource management, enabling YARN-based analytic engines to run side-by-side. The Azure HDInsight managed Hadoop service already supports the most widely used open source engines, including Hive, HBase, Storm and Spark. We are also introducing a new query language called U-SQL, as part of the Azure Data Lake Analytics service. It is aligned with T-SQL from a language perspective, and is designed to additionally support Map-Reduce style execution of user code through a generalization of SQL’s GROUP BY construct. U-SQL runs natively on YARN as an AM, just like any other analytic engine from the Hadoop ecosystem. Thus, users can run U-SQL right alongside Hive, Spark, Storm, etc., on data stored in the Azure Data Lake! Customers can securely store any data in the Azure Data Lake Store, and mix and match the right analytic tools for their application from OSS as well as from vendors (Microsoft included) as long as the tools run as part of the Hadoop ecosystem, plugging into HDFS and YARN.

Members of the big data team at Microsoft have been contributing to the development of YARN since the inception of the Apache YARN project, and several of our code contributions now ship as part of Apache YARN releases.  With YARN being increasingly deployed within the company, we are continuously enhancing YARN in order to meet our growing needs, for example, the much greater scale and sharing that must be supported in our environments. We are bringing our learnings back to the Hadoop community, and working with them in the spirit of OSS by initiating or contributing to several Apache projects related to YARN, some of which I mention below, together with the Microsoft scenarios that motivated them.

One of our initial contributions to YARN was support for work-conserving preemption (see YARN-45), which is especially important for long-running jobs such as we find commonly in our larger environments. Previously, the YARN capacity scheduler was unable to provide high utilization while guaranteeing cluster sharing invariants. The scheduler had to leave resources fallow, adversely affecting utilization. Adding preemption support to the scheduler addressed this problem. This feature has been part of YARN since the Hadoop 2.1 release and has since become a widely deployed YARN feature.

Rayon (see YARN-1051) is a resource reservation component that ships with the Hadoop 2.6 release, motivated by our sharing of production clusters with experimentation. As workload consolidation is becoming commonplace, the YARN resource manager required enhancements to support allocation guarantees for production jobs that share the cluster with best-effort jobs. Rayon allows users/operators to reserve resources for production jobs and thereby obtain predictable performance.

Mercury (see YARN-2877) and Tetris (see YARN-2745) enhance the YARN scheduler. Mercury addresses low-latency allocations and utilization of otherwise idle resources to improve cluster throughput, reflecting some aspects of Scope job scheduling. Tetris uses bin-packing algorithms to compactly fit tasks on machines and thereby improves utilization. We are working closely with the community (through discussions, code reviews, etc.) to get our code accepted into Apache YARN.

Given our extensive development on YARN, we conceived and developed REEF at Microsoft to ease building of Java and C# applications on YARN and other resource managers. It has since been incubated at Apache and attracted numerous committers from several academic and commercial organizations.

The much larger cluster sizes typical in Microsoft (recall that Cosmos clusters can be larger than 50,000 nodes; Hadoop clusters elsewhere are an order of magnitude smaller) have also motivated us to start an effort on scale-out resource management (see YARN-2915). Apache YARN has been deployed in production settings to manage clusters that are up to 6,000 nodes in size. To scale YARN to clusters of datacenter scale, i.e., manage hundreds of thousands of nodes, we are pursuing a federation-style approach. That is, individual sub-clusters are about 6,000 nodes in size—allowing us to leverage the community’s work on building robust YARN clusters of that size—and are composed together to form larger YARN clusters. Federation presents the entire cluster fleet of computers as a single system that is API-compatible with standard YARN. In particular, this means that individual jobs can span the entire federation. We are currently working on the Federation implementation on a separate branch in Apache. All discussions and code reviews happen on this branch, which allows us to openly engage with the OSS community.

We have shared our work widely through papers and presentations at academic and industry conferences, and members of the Microsoft big data team are among the authors of the first academic paper describing YARN, Apache Hadoop YARN: Yet Another Resource Negotiator, which won the best paper award at the ACM SoCC 2013 conference.

Microsoft’s contributions to Hadoop extend beyond YARN. Some of the other notable contributions include:

  • Hadoop on Azure and Windows: Azure HDInsight, one of the fastest growing Azure services, is the result of a collaboration between Microsoft and Hortonworks. Microsoft has contributed numerous JIRAs (listed here) enabling Hadoop services to run on Azure and on Windows. This includes a driver for WASB (JIRA available here) that allows Hadoop applications to access data stored on Windows Azure Store.

  • Hive and ORC: We proposed, designed and jointly (with Hortonworks) developed the vectorization support in Hive. In addition, Microsoft engaged with the community to design the Optimized Row Columnar (ORC) format.

  • OAuth2 support in WebHDFS: With the goal of making cloud-based big data stores accessible via WebHDFS, we have added OAuth2 support as part of the change detailed here. It is expected to be available with Apache Hadoop release 2.8.

  • Spark Kernel for Jupyter: We worked with the Jupyter community on a proposal to decouple Jupyter IPython `notebooks from Spark clusters. Remote Spark execution through a REST endpoint enables support for all languages (Scala, Python, R) from the Jupyter notebook.  Spark is the first back end provided by this proposal, and it can easily be extended to other big data back ends.

In summary, Azure Data Lake builds on our experience building and operating big data platforms and services at Microsoft, including Cosmos, Scope, Hive, Spark and IPython notebooks. To give customers their choice of analytic tools with open interfaces, we have architected and built Azure Data Lake as a part of the Hadoop ecosystem, building on HDFS and YARN. I would like to take this moment to thank all the people in Microsoft and the open source community who have contributed to these incredible software systems, and advanced our collective ability to get the most out of our increasingly rich and expanding data collections.

Please take a dip in our Data Lake, and give us your feedback!