“Continuing our Azure reliability series to be as transparent as possible about key initiatives underway to keep improving availability, today we turn our attention to Azure Active Directory. Microsoft Azure Active Directory (Azure AD) is a cloud identity service that provides secure access to over 250 million monthly active users, connecting over 1.4 million unique applications and processing over 30 billion daily authentication requests. This makes Azure AD not only the largest enterprise Identity and Access Management solution, but easily one of the world’s largest services. The post that follows was written by Nadim Abdo, Partner Director of Engineering, who is leading these efforts.” - Mark Russinovich, CTO, Azure
Our customers trust Azure AD to manage secure access to all their applications and services. For us, this means that every authentication request is a mission critical operation. Given the critical nature and the scale of the service, our identity team’s top priority is the reliability and security of the service. Azure AD is engineered for availability and security using a truly cloud-native, hyper-scale, multi-tenant architecture and our team has a continual program of raising the bar on reliability and security.
Azure AD: Core availability principles
Engineering a service of this scale, complexity, and mission criticality to be highly available in a world where everything we build on can and does fail is a complex task.
Our resilience investments are organized around the set of reliability principles below:
Our availability work adopts a layered defense approach to reduce the possibility of customer visible failure as much as possible; if a failure does occur, scope down the impact of that failure as much as possible, and finally, reduce the time it takes to recover and mitigate a failure as much as possible.
Over the coming weeks and months, we dive deeper into how each of the principles is designed and verified in practice, as well as provide examples of how they work for our customers.
Azure AD is a global service with multiple levels of internal redundancy and automatic recoverability. Azure AD is deployed in over 30 datacenters around the world leveraging Azure Availability Zones where present. This number is growing rapidly as additional Azure Regions are deployed.
For durability, any piece of data written to Azure AD is replicated to at least 4 and up to 13 datacenters depending on your tenant configuration. Within each data center, data is again replicated at least 9 times for durability but also to scale out capacity to serve authentication load. To illustrate—this means that at any point in time, there are at least 36 copies of your directory data available within our service in our smallest region. For durability, writes to Azure AD are not completed until a successful commit to an out of region datacenter.
This approach gives us both durability of the data and massive redundancy—multiple network paths and datacenters can serve any given authorization request, and the system automatically and intelligently retries and routes around failures both inside a datacenter and across datacenters.
To validate this, we regularly exercise fault injection and validate the system’s resiliency to failure of the system components Azure AD is built on. This extends all the way to taking out entire datacenters on a regular basis to confirm the system can tolerate the loss of a datacenter with zero customer impact.
No single points of failure (SPOF)
As mentioned, Azure AD itself is architected with multiple levels of internal resilience, but our principle extends even further to have resilience in all our external dependencies. This is expressed in our no single point of failure (SPOF) principle.
Given the criticality of our services we don’t accept SPOFs in critical external systems like Distributed Name Service (DNS), content delivery networks (CDN), or Telco providers that transport our multi-factor authentication (MFA), including SMS and Voice. For each of these systems, we use multiple redundant systems configured in a full active-active configuration.
Much of that work on this principle has come to completion over the last calendar year, and to illustrate, when a large DNS provider recently had an outage, Azure AD was entirely unaffected because we had an active/active path to an alternate provider.
Azure AD is already a massive system running on over 300,000 CPU Cores and able to rely on the massive scalability of the Azure Cloud to dynamically and rapidly scale up to meet any demand. This can include both natural increases in traffic, such as a 9AM peak in authentications in a given region, but also huge surges in new traffic served by our Azure AD B2C which powers some of the world’s largest events and frequently sees rushes of millions of new users.
As an added level of resilience, Azure AD over-provisions its capacity and a design point is that the failover of an entire datacenter does not require any additional provisioning of capacity to handle the redistributed load. This gives us the flexibility to know that in an emergency we already have all the capacity we need on hand.
Safe deployment ensures that changes (code or configuration) progress gradually from internal automation to internal to Microsoft self-hosting rings to production. Within production we adopt a very graduated and slow ramp up of the percentage of users exposed to a change with automated health checks gating progression from one ring of deployment to the next. This entire process takes over a week to fully rollout a change across production and can at any time rapidly rollback to the last well-known healthy state.
This system regularly catches potential failures in what we call our ‘early rings’ that are entirely internal to Microsoft and prevents their rollout to rings that would impact customer/production traffic.
To support the health checks that gate safe deployment and give our engineering team insight into the health of the systems, Azure AD emits a massive amount of internal telemetry, metrics, and signals used to monitor the health of our systems. At our scale, this is over 11 PetaBytes a week of signals that feed our automated health monitoring systems. Those systems in turn trigger alerting to automation as well as our team of 24/7/365 engineers that respond to any potential degradation in availability or Quality of Service (QoS).
Our journey here is expanding that telemetry to provide optics of not just the health of the services, but metrics that truly represent the end-to-end health of a given scenario for a given tenant. Our team is already alerting on these metrics internally and we’re evaluating how to expose this per-tenant health data directly to customers in the Azure Portal.
Partitioning and fine-grained fault domains
A good analogy to better understand Azure AD are the compartments in a submarine, designed to be able to flood without affecting either other compartments or the integrity of the entire vessel.
The equivalent for Azure AD is a fault domain, the scale units that serve a set of tenants in a fault domain are architected to be completely isolated from other fault domain’s scale units. These fault domains provide hard isolation of many classes of failures such that the ‘blast radius’ of a fault is contained in a given fault domain.
Azure AD up to now has consisted of five separate fault domains. Over the last year, and completed by next summer, this number will increase to 50 fault domains, and many services, including Azure Multi-Factor Authentication (MFA), are moving to become fully isolated in those same fault domains.
This hard-partitioning work is designed to be a final catch all that scopes any outage or failure to no more than 1/50 or ~2% of our users. Our objective is to increase this even further to hundreds of fault domains in the following year.
A preview of what’s to come
The principles above aim to harden the core Azure AD service. Given the critical nature of Azure AD, we’re not stopping there—future posts will cover new investments we’re making including rolling out in production a second and completely fault-decorrelated identity service that can provide seamless fallback authentication support in the event of a failure in the primary Azure AD service.
Think of this as the equivalent to a backup generator or uninterruptible power supply (UPS) system that can provide coverage and protection in the event the primary power grid is impacted. This system is completely transparent and seamless to end users and is now in production protecting a portion of our critical authentication flows for a set of M365 workloads. We’ll be rapidly expanding its applicability to cover more scenarios and workloads.
We look forward to sharing more on our Azure Active Directory Identity Blog, hearing your questions and topics of interest for future posts.