Advancing service resilience in Azure Active Directory with its backup authentication service

Continuing our Advancing Reliability blog series, which highlights key updates and initiatives related to improving the reliability of the Azure platform and services, today we turn our focus to Azure Active Directory (Azure AD). We laid out the core availability principles of Azure AD as part of this series back in 2019 so I’ve asked Nadim Abdo, Corporate Vice President, Engineering, to provide the latest update on how our engineering teams are working to ensure the reliability of our identity and access management services that are so critical to customers and partners.”—Mark Russinovich, CTO, Azure

The most critical promise of our identity services is ensuring that every user can access the apps and services they need without interruption. We’ve been strengthening this promise to you through a multi-layered approach, leading to our improved promise of 99.99 percent authentication uptime for Azure Active Directory (Azure AD). Today, I am excited to share a deep dive into generally available technology that allows Azure AD to achieve even higher levels of resiliency.

The Azure AD backup authentication service transparently and automatically handles authentications for supported workloads when the primary Azure AD service is unavailable. It adds an additional layer of resilience on top of the multiple levels of redundancy in Azure AD. You can think of it as a backup generator or uninterrupted power supply designed to provide additional fault tolerance while staying completely transparent and automatic to you. This system operates in the Microsoft cloud but on separate and decorrelated systems and network paths from the primary Azure AD system. This means that it can continue to operate in case of service, network, or capacity issues across many Azure AD and dependent Azure services.

What workloads are covered by the service?

This service has been protecting Outlook Web Access and SharePoint Online workloads since 2019. Earlier this year we completed backup support for applications running on desktops and mobile devices, or “native” apps. All Microsoft native apps including Office 365 and Teams, plus non-Microsoft and customer-owned applications running natively on devices are now covered. No special action or configuration changes are required to receive the backup authentication coverage.

Starting at the end of 2021, we will begin rolling out support for more web-based applications. We will be phasing in apps using Open ID Connect, starting with Microsoft web apps like Teams Online and Office 365, followed by customer-owned web apps that use Open ID Connect and Security Assertion Markup Language (SAML).

How does the service work?

When a failure of the Azure AD primary service is detected, the backup authentication service automatically engages, allowing the user’s applications to keep working. As the primary service recovers, authentication requests are re-routed back to the primary Azure AD service. The backup authentication service operates in two modes:

Normal mode: The backup service stores essential authentication data during normal operating conditions. Successful authentication responses from Azure AD to dependent apps generate session-specific data that is securely stored by the backup service for up to three days. The authentication data is specific to a device-user-app-resource combination and represents a snapshot of a successful authentication at a point in time.
Outage mode: Any time an authentication request fails unexpectedly, the Azure AD gateway automatically routes it to the backup service. It then authenticates the request, verifies artifacts presented are valid (such as, refresh token, and session cookie), and looks for a strict session match in the previously stored data. An authentication response, consistent with what the primary Azure AD system would have generated, is then sent to the application. Upon recovery, traffic is dynamically re-routed back to the primary Azure AD service.

Diagram showing clients/services like Outlook and Exchange Online accessing tokens â€“ including cached access tokens from the new Backup Auth service

Routing to the backup service is automatic and its authentication responses are consistent with those usually coming from the primary Azure AD service. This means that the protection kicks in with no need for application modifications, nor manual intervention.

Note that the priority of the backup authentication service is to keep user productivity alive for access to an app or resource where authentication was recently granted. This happens to be most of the type of requests to Azure AD—93 percent, in fact. “New” authentications beyond the three-day storage window, where access was not recently granted on the user’s current device, are not currently supported during outages, but most users access their most important applications daily from a consistent device.

How are security policies and access compliance enforced during an outage?

The backup authentication service continuously monitors security events which affect user access to keep accounts secure, even if these events are detected right before an outage. It uses Continuous Access Evaluation to ensure the sessions that are no longer valid are revoked immediately. Examples of security events that would cause the backup service to restrict access during an outage include changes to device state, account disablement, account deletion, access being revoked by an admin, or detection of a high user risk event. Only once the primary authentication service has been restored would a user with a security event be able to regain access.

In addition, the backup authentication service enforces Conditional Access policies. Policies are re-evaluated by the backup service before granting access during an outage to determine which policies apply and whether the required controls for applicable policies like multi-factor authentication (MFA) have been satisfied. If an authentication request is received by the backup service and a control like MFA has not been satisfied, then that authentication would be blocked.

Conditional Access policies that rely on conditions such as user, application, device platform, and IP address are enforced using real-time data as detected by the backup authentication service. However, certain policy conditions (such as sign-in risk and role membership) cannot be evaluated in real-time, and are evaluated based on resilience settings. Resilience defaults enable Azure AD to safely maximize productivity when a condition (such as group membership) is not available in real-time during an outage. The service will evaluate a policy assuming that the condition has not changed since the latest access just before the outage.

While we highly recommend customers to keep resilience defaults enabled, there may be some scenarios where admins would rather block access during an outage when a Conditional Access condition cannot be evaluated in real-time. For these rare cases, administrators can disable resilience defaults per policy within Conditional Access. If resilience defaults are disabled by policy, the backup authentication service will not serve requests that are subject to real-time policy conditions, meaning those users may be blocked by a primary Azure AD outage.

What is next?

The Azure AD backup authentication service helps users stay productive in the unlikely scenario of an Azure AD primary authentication outage. The service provides another transparent layer of redundancy to our service in a decorrelated Microsoft cloud and network pathways. In the future, we will continue to expand protocol support, scenario support, and coverage beyond public clouds and we will expand the visibility of the service for our advanced customers.

Thank you for your ongoing trust and partnership.

Advancing service resilience in Azure Active Directory with its backup authentication service

What workloads are covered by the service?

How does the service work?

How are security policies and access compliance enforced during an outage?

What is next?

Mark Russinovich

Meet Brain: The AI system behind Azure reliability

Proving application resilience on Azure with Chaos Studio

Azure IaaS: Defense in depth built on secure-by-design principles

Explore Microsoft Foundry

Advancing service resilience in Azure Active Directory with its backup authentication service

What workloads are covered by the service?

How does the service work?

How are security policies and access compliance enforced during an outage?

What is next?

Mark Russinovich

Related posts

Meet Brain: The AI system behind Azure reliability

Proving application resilience on Azure with Chaos Studio

Azure IaaS: Defense in depth built on secure-by-design principles

Explore Microsoft Foundry