At Microsoft, like many companies using data for competitive advantage, opportunities for insight abound and our analytics needs were scaling fast – almost out of control. We invested in Yet Another Resource Manager (YARN) to meet the demands of an exabyte-scale analytics platform and ended up creating the world’s largest YARN Cluster.
In big data, how big is really big?
Yarn is known to scale to thousands of nodes, but what happens when you need to tens of thousands of nodes? The Cloud & Information Service at Microsoft is a highly specialized team of experts that work on applied research and science initiatives focusing on data processing and distributed systems. This blog explains how CISL and the Microsoft Big Data team met the challenge of complex scale and resource management – and ended up implementing the world's largest YARN cluster to drive its exabyte-sized analytics.
For more than a decade Microsoft has depended on internal version of the publicly available Azure Data Lake for its own super-sized analytics. The volume of data and complexity of calculation has caused it to scale to several larger clusters. To the best of our knowledge, Microsoft is currently running the largest Yarn cluster in the world at over 50 thousand nodes in a single cluster. Exabytes of data are processed daily from business units like Bing, AdCenter, MSN and Windows live. More than 15,000 developers use it across the company.
Microsoft’s journey began in 2007 with a manageably small number of machines. Today, in 2018, the system operates on hundreds of thousands of machines. It boasts individual clusters of up to 50 thousand nodes with more than 15,000 developers innovating on data that is counted in exabytes – and growing.
The challenges of scale
More users means multiple frameworks on same cluster
As a big data users base matures, users want to run more and more diverse applications on a shared data and compute infrastructure. If you build silos, you destroy utilization. MSFT bet on YARN from the Apache Hadoop community to get around this. YARN is the resource management framework that enables arbitrary applications to share the same compute resources to access a common pool of data in a distributed file system. Since betting on YARN, Microsoft rolled up its sleeves and contributed innovations to the open source Apache Hadoop community, which lead to several Microsoft employees to become committers and PMC members for the project.
Scale matters – use larger clusters to avoid fragmentation
As data and application volumes grow, scale becomes an issue. The inexperienced solution to meet demand is to build more smaller clusters. This, like the ideas of silos for different applications, is unhealthy, slow and error prone because resources become fragmented, harming utilization, and data must be continuously copied across clusters. Best practice is to consolidate work into a small number of large clusters. This is what motivated Microsoft to embrace YARN and extend it to support scalability to data center sized clusters (tens of thousands of nodes).
Microsoft contributed to YARN a feature called "YARN Federation" [1-3], which allows us to operate multiple smaller clusters (each between 2-5 thousand nodes), and tie them together in a way that is transparent to users, and enables sophisticated load balancing via policies.
Squeezing the last ounce of power – no core left behind
In a standard cluster the central resource manager assigns resources. The process of centralized resource negotiation has inherent inefficiencies due to its heartbeat-based communications. The heart beats are non-variable rhythm at which commands can be executed. While this allows large scale coordination of an execution it also limits the maximum utilization of resources. The inexperienced solution is to throw more nodes in the cluster to achieve desired performance; this can incur unacceptable overheads. Since resources may become idle in between the heartbeats, Microsoft introduced in YARN the notion of opportunistic containers, inherited from our internal infrastructure. These containers are queued locally at the node and run exclusively on scavenged capacity – in other words the resources that are available in between heartbeats. Opportunistic containers also enable resource overbooking. Just like airlines overbook a flight to ensure that it leaves full, opportunistic containers do the same and ensure that idle resources are fully utilized. Since they are scavengers they can easily be preempted by higher priority containers, and re-queued – or even promoted to high priority. At Microsoft, across our fleet, we estimate benefits of using opportunistic containers to be in the order of 15-30 percent, which translates into 100s of millions of dollars savings per year. Microsoft has contributed this featured to YARN , for a deeper technical discussion see [5-6].
Taking it to the next level – use reservations to meet SLOs in a busy cluster
As your customer base matures, people rely on the system for mission critical work which means more recurrent jobs with explicit deadlines. Existing technologies don’t support this use case. That’s why in YARN Microsoft introduced reservations to support time bases SLOs.
Reservations guarantee that the resources you need will be available at the precise time you need them. Before, it was gruesome and painful for users. They had to get in line and carefully tune queue sizes and application priorities to obtain the timeliest execution of their job. Even with this they were always subject to the resources being freed up and hence could be made to wait. The system was inherently unreliable. Today, with the new “deadline reservations” capability, users get the resources they signed up for precisely at the time they need them.
With this new capability, every reservation request comes with some flexibility (time or shape). Customers are incentivized to expose more flexibility to increase their chances of being admitted. Think about this in the context of booking a hotel room for your seashore vacation. You need your room to be available during some time period (when you can get vacation…) and you have preferences about rooms size and room amenities (king, queen, …). You increase your chances of finding a hotel room by indicating your flexibility on the characteristic of the room or time period.
In the case of YARN the reservation system takes into account the flexibilities you indicate in order to provide you the most optimized resources that meet your need at the time you need them. By leveraging these flexibilities, it is able to densely pack the cluster agenda to achieve the highest possible utilization of the overall systems. This translates into lower operating costs and thus better price for users. This feature is also open-sourced by Microsoft to Apache Hadoop [7-8].
From here your journey will only grow
Wherever you are on your Big Data Analytics journey, expect it to grow. More sources of data are coming on line daily – and business analysts and data scientists are finding more ways to use that data. Efficiently operating Big Data infrastructure is hard – but we are making it easier for you.
Microsoft invests continually in making its systems efficient. These efforts translate into benefits for the larger community in three ways:
- This massive analytics system is used every day to drive improvement in to Microsoft products and services.
- The techniques exposed here benefit the developers and analysts who use the Azure services to provide critical insights to their own businesses.
- And lastly, anyone using YARN benefits from the enhancements Microsoft has contributed to the open source community.
All of the contributions referenced above are available to the public in any release since Apache Hadoop 2.9.
 The initial YARN Federation effort is tracked in JIRA
 Ongoing extensions to Federation are tracked in JIRA
 "Hydra: a federated resource manager for data-center scale analytics", to appear in NSDI 2019
 Opportunistic tokens extensions are tracked in JIRA: Extend YARN to support distributed scheduling and Scheduling of OPPORTUNISTIC containers through YARN RM
 Efficient Queue Management for Cluster Scheduling Jeff Rasley, Konstantinos Karanasos, Srikanth Kandula, Rodrigo Fonseca, Sriram Rao, Milan Vojnovic European Conference on Computer Systems (EuroSys)
 Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas, Kishore Chaliparambil, Giovanni Matteo Fumarola, Solom Heddaya, Raghu Ramakrishnan, Sarvesh Sakalanaga USENIX Annual Technical Conference (USENIX ATC'2015) USENIX – Advanced Computing Systems Association July 1, 2015
 YARN Admission Control/Planner: enhancing the resource allocation model with time
 "Reservation based scheduling: if you are late don't blame us!", SoCC 2014