Understanding and leveraging Azure SQL Database’s SLA

Data di pubblicazione: 31 luglio, 2019

Principal Program Manager, Azure SQL Database

When data is the lifeblood of your business, you want to ensure your databases are reliable, secure, and available when called upon to perform. Service level agreements (SLA) set an expectation for uptime and performance, and are a key input for designing systems to meet business needs. We recently published a new version of the SQL Database SLA, guaranteeing the highest availability among relational database services as well as introducing the industry’s first business continuity SLA. These updates further cement our commitment to ensuring your data is safe and the apps and processes your business relies upon continue running in the face of a disruptive event.

As we indicated in the recent service update, we made two major changes in the SLA. First, Azure SQL Database now offers a 99.995% availability SLA for zone redundant databases in its business critical tier. This is the highest SLA in the industry among all relational database services. It is also backed by up to a 100% monthly cost credit for when the SLA is not maintained. Second, we offer a business continuity SLA for databases in the business critical tier that are geo-replicated between two different Azure regions. That SLA comes with very strong guarantees of a five second recovery point objective (RPO) and a 30 second recovery time objective (RTO), including a 100% monthly cost credit when the SLA is not maintained. Azure SQL Database is the only relational database service in the industry offering a business continuity SLA.

The following table provides a quick side by side comparison of different cloud vendors’ SLAs.

Platform Availability Business continuity
Uptime Max Credit RTO Max Credit RPO Max Credit
Azure SQL Database 99.995% 100% 30 seconds 100% 5 seconds 100%
AWS RDS 99.95% 100% n/a n/a n/a n/a
GCP Cloud SQL 99.95% 50% n/a n/a n/a n/a
Alibaba ApsaraDB 99.9% 25% n/a n/a n/a n/a
Oracle cloud 99.99% 25% n/a n/a n/a n/a

Data current as of July 18, 2019 and subject to change without notice.

Understanding availability SLA

The availability SLA reflects SQL Database’s ability to automatically handle disruptive events that periodically occur in every region. It relies on the in-region redundancy of the compute and storage resources, constant health monitoring and self-healing operations using automatic failover within the region. These operations rely on synchronously replicated data and incur zero data loss. Therefore, uptime is the most important metric for availability. Azure SQL Database will continue to offer a baseline 99.99% availability SLA across all of its service tiers, but is now providing a higher 99.995% SLA for the business critical or premium tiers in the regions that support availability zones. The business critical tier, as the name suggests, is designed for the most demanding applications, both in terms of performance and reliability. By integrating this service tier with Azure availability zones (AZ), we leverage the additional fault tolerance and isolation that AZs provide, which in turn allows us to offer a higher availability guarantee using the compute and storage redundancy across AZs and the same self-healing operations. Because the compute and storage redundancy is built in for business critical databases and elastic pools, using availability zones comes at no additional cost to you. Our documentation, “High-availability and Azure SQL Database” provides more details of how the business critical service tier leverages availability zones. You can also find the list of regions that support AZs in our documentation, “What are Availability Zones in Azure.”

99.99% availability means that for any database, including those in the business critical tier, the downtime should not exceed 52.56 minutes per year. Zone redundancy increases availability to 99.995%, which means a maximum downtime of only 26.28 minutes per year or a 50% reduction. A minute of downtime is defined as the period during which all attempts to establish a connection failed. To achieve this level of availability, all you need to do is select zone redundant configuration when creating a business critical database or elastic pool. You can do so programmatically using a create or update database API, or in Azure portal as illustrated in the following diagram.

Screenshot of create or update database API in Azure portal

We recommend using the Gen5 compute generation because the zone redundant capacity is based on Gen5 in most regions. The conversion to a zone redundant configuration is an asynchronous online process, similar to what happens when you change the service tier or compute size of the database. It does not require acquiescing or taking your application offline. As long as your connectivity logic is properly implemented, your application will not be interrupted during this transition.

Understanding business continuity SLA

Business continuity is the ability of a service to quickly recover and continue to function during catastrophic events with an impact that cannot be mitigated by the in-region self-healing operations. While these types of unplanned events are rare, their impact can be dramatic. Business continuity is implemented by provisioning stand-by replicas of your databases in two or more geographically separated locations. Because of the long distances between those locations, asynchronous data replication is used to avoid performance impact from network latency. The main trade-off of using asynchronous replication is the potential for data loss. The active geo-replication feature in SQL Database is designed to enable business continuity by creating and managing geographically redundant databases. It’s been in production for several years and we have plenty of telemetry to support very aggressive guarantees.

There are two common metrics used to measure the impact of business continuity events. Recovery time objective (RTO) measures how quickly the availability of the application can be restored. Recovery point objective (RPO) measures the maximum expected data loss after the availability is restored. Not only do we provide SLAs of five seconds for RPO and 30 seconds for RTO, but we also offer an industry first, 100% service credit if these SLAs are not met. That means if any of your database failover requests do not complete within 30 seconds or any time the replication lag exceeds five seconds in 99th percentile within an hour, you are eligible for a service credit for 100% of the monthly cost of the secondary database in question. To qualify for the service credit, the secondary database must have the same compute size as the primary. Note however, these metrics should not be interpreted as a guarantee of automatic recovery from a catastrophic outage. They reflect the Azure SQL’s reliability and performance when synchronizing your data and the speed of the failover when your application requests it. If you prefer a fully automated recovery process, you should consider auto-failover groups with automatic failover policy, which has a one hour RTO.

To measure the duration of the failover request, i.e. the RTO compliance, you can use the following query against the sys.dm_operation_status in master database on the secondary server. Please be aware that the operation status information is only kept for 24 hours.

SELECT  datediff(s, start_time, last_modify_time) as [Failover time in seconds] FROM sys.dm_operation_status    WHERE major_resource_id = '<my_secondary_db_name>',  operation=’ALTER DATABASE FORCE FAILOVER ALLOW DATA LOSS ’, state=2 ORDER BY start_time DESC;

The following query against sys.dm_replication_link_status in the primary database will show replication lag in seconds, i.e. the RPO compliance, for the secondary database created on partner_server. You should run the same query every 30 seconds or less to have a statistically significant set of measurements per hour.

SELECT link_guid, partner_server, replication_lag_sec FROM sys.dm_replication_link_status

Combining availability and business continuity to build mission critical applications

What does the updated SLA mean to you in practical terms? Our goal is enabling you to build highly resilient and reliable services on Azure, backed by SQL Database. But for some mission critical applications, even 26 minutes of downtime per year may not be acceptable. Combining a zone redundant database configuration with a business continuity design creates an opportunity to further increase availability for the application. This SLA release is the first step toward realizing that opportunity.