[Updated on 03/03/2015]
Improvement to the Estimated Recovery Time (ERT) and Recovery Point Objective (RPO) for Basic, Standard and Premium database tiers.
In this post we will continue the conversation about the business continuity scenarios and discuss the newly released Standard Geo-Replication feature of Azure SQL Database. Standard Geo-Replication allows you to configure your database to asynchronously replicate committed transactions from the primary database to the secondary in a predefined Azure region. Before diving into the details it would be useful to summarize the full range of the business continuity features now available in preview and discuss them in the context of the new service tiers we announced in April 2014. You can find the details of the specific characteristics of the service tiers in this article.
Business continuity model
In the previous post on Active Geo-Replication, Tobias already defined the business continuity challenge. As a quick refresher there are a few concepts you need to be familiar with to get the most out of this post:
- Disaster recovery (DR): a process of restoring the normal business function of the application
- Point in time restore: the ability to restore the database to a point in time in the past (within the backup retention period) in order to recover from data corruption caused by a human mistake or programmatic error
- Estimated Recovery Time (ERT): The estimated duration for the database to be fully functional after a restore/failover request.
- Recovery Point Objective (RPO): The amount of most recent data changes (time interval) the application could lose after recovery.
The following table shows the differences of the business continuity features across the service tiers:
|Point In Time Restore||Any restore point within 7 days||Any restore point within 14 days||Any restore point within 35 days|
|Geo-Restore||ERT < 12h
RPO < 1h
|ERT < 12h
RPO < 1h
|ERT < 12h
RPO < 1h
|Standard Geo-Replication||Not included||ERT < 30s
RPO < 5s
|ERT < 30s
RPO < 5s
|Active Geo-Replication||Not included||Not included||ERT < 30s
RPO < 5s
The key factor defining the Business Continuity/Disaster Recovery (BCDR) model is that an application’s performance profile can drive which business continuity solution will meet its needs. If your application processes low volumes of data with a low rate of updates and the Basic tier provides the appropriate performance level then geo-restore should meet its recovery needs. If your application processes higher volumes of data and requires the higher performance levels of Standard tier and a more aggressive RPO, then standard geo-replication will likely be the right disaster recovery solution. Active geo-replication has been designed for applications that process very high volumes of updates and require the most advanced recovery with the lowest RPO. In this way SQL Database’s BCDR model provides increasingly robust business continuity features as you scale from Basic to Premium. And because data processing that is more read oriented may need a less aggressive RPO, each higher service tier also includes the BCDR features available in the lower service tier.
How is Standard Geo-replication different from Active Geo-replication?
Let’s take a closer look at the user experience and how it differs from active geo-replication.
First of all, standard geo-replication is built on the same technology as active geo-replication but is optimized for less write-intensive applications. The following key considerations were used in defining the feature: (a) it targets applications with an update rate that does not justify an aggressive disaster recovery SLA; (b) these applications need only a simple recovery workflow that does not require sophisticated monitoring logic to make a failover decision; and (c) these applications are typically cost sensitive. To meet these expectations we introduced the following simplifications:
1. One secondary database is created in a Microsoft defined “DR paired” Azure region. The list of the DR pairs can be found here.
2. The secondary is visible in the master database but cannot be directly connected to until failover is completed (offline secondary).
3. The secondary database is charged at a discounted rate as it is not readable (offline).
4. The ability to failover is enabled by the service in the case of a long-lasting outage of the datacenter hosting the primary database. This may take up to 1 hour from the beginning of the incident to declare it. The portal will show the impacted servers as “degraded”.
5. In the case of an outage, if the application or customer doesn’t initiate the failover and if the primary location has not recovered within 24 hours of the incident, all databases with standard geo-replication will automatically failover to the secondary location.
Figure 1 depicts a typical configuration of Standard Geo-replication:
Figure 1. A database can have one offline secondary in the DR paired region.
As you can see, the focus of standard geo-replication is to enable database recovery from the Azure SQL Database service’s large scale outage in a region. To do that only a subset of the capabilities of the more powerful active geo-replication is needed. However, some scenarios you could enable using the latter are not possible with the standard geo-replication. The following table summarizes this delta:
|Online application upgrade||No||Yes|
|Online application relocation||No||Yes|
|Read load balancing||No||Yes|
Why would I use Standard Geo-replication instead of Active Geo-replication with a Premium database?
Premium databases can be protected by either or both standard geo-replication and active-geo-replication. So when would you choose to use standard geo-replication over the more powerful active geo-replication? Standard geo-replication has been designed as a simpler and cheaper DR solution, particularly suited to applications with a lower update rate. If a premium database is being used to address primarily a high-volume read-oriented workload, standard geo-replication may be a good fit. With standard geo-replication you lose the ability to select the location of the secondary, to have read access to up to four secondaries and to have full control of when and where you failover. In return, with standard geo-replication you get a simplified monitoring and failover workflow that is managed by Microsoft. But you don’t have to choose between them, with a premium database you can elect to create an offline secondary with standard geo-replication for DR purposes and then create one or more readable secondaries to support load balancing for intensive read workloads.
Standard geo-replication is designed specifically to provide a DR solution for data tier outages. Unless there is an Azure SQL Database service outage in the region hosting the primary, you will not be able to initiate a failover to the secondary in the paired region. Standard geo-replication doesn’t support manually failing over a database because of an application tier outage. Active geo-replication provides the failover and location control you need for this scenario.
If a region has an extended outage Microsoft will enable failover of all databases protected by standard geo-replication. The goal is that the service will determine if a failover is required within one hour of the incident. Once enabled you can initiate actual failover by terminating the geo-replication relationship on the secondary database. This approach makes the failover logic simple. An application can simply wait for the failover-enabled flag and then decide to failover or wait for the datacenter to recover. If your application needs to optimize for higher availability and can tolerate RPO of 5 seconds then it should failover as soon as it is enabled by the service. If your application is sensitive to data loss you may opt to wait for the SQL Database service to recover. If this happens no data loss will occur. However, if the primary datacenter (and the databases) is not recovered within 24 hours failover will be initiated by the service automatically. In that case the application will have data loss (with an RPO < 5 seconds) and 24 hours of downtime. In all cases, whether you failover the database or wait until it happens automatically you must still reconfigure your applications appropriately to connect to the failed over databases
Once you have completed the failover you will want to ensure that the new primary is also protected. As part of the process of enabling failover the service will update its DR pairing configuration to use an alternate region. This will allow you to initiate geo-replication from the new primary to protect it. As seeding the new secondary will take some time (potentially hours depending on the size of database) you will need to decide if you want to restore availability during this seeding period and take the risk of operating the application without protection. The most prudent approach will be to wait until the failed over database is protected by its new secondary before activating the application. After seeding is completed, the DR configuration may look like the one depicted in Figure 2:
Figure 2. Application can create a new secondary database after failover using the updated DR pairing.
Disaster Recovery Drills
Because database failover is associated with data and is a disruptive process, the failover workflow should be periodically tested in order to ensure the application’s readiness. This process is called a DR drill. In addition to being a good engineering practice it is also required by most industry security standards as part of compliance certification. While failover from the offline secondary is not enabled under normal circumstances, you can still test the overall DR workflow by stopping geo-replication from the primary database. Note that if the primary is active at the point of termination any transactions committed on the primary but not yet replicated to the secondary will be lost. After termination the secondary will become fully accessible and the application can use it as the new primary. Due to the possibility of data loss and that during the drill the primary will not be protected, we don’t recommend that customers perform DR drills on production databases. Instead we recommend creating a test copy of a database in the same region, creating an offline secondary of the copy and then use the copy and its secondary to verify the application’s failover workflow in a test context.
Standard geo-replication offers a subset of the REST API already available with active geo-replication. To manage standard geo-replication you can use this API, provided PowerShell cmdlets or the Azure Management Portal. Because this feature are in preview you need to sign-up to the preview of the Azure SQL Database new service tiers first. The easiest way to enable standard geo-replication is by using the geo-replication tab in the Azure Management Portal as shown on Figure 3.
Figure 3. Use the Azure Management portal to create and monitor the status of the offline geo-secondary.
It is important to note that you can opt out from geo-replication at any time. There are two ways of doing it. You can either terminate the geo-replication relationship on the primary database or simply drop the secondary database. The first method is most useful as a way to cancel the operation initiated by mistake. If it is issued before the secondary database seeding is completed you will not be billed for the secondary database. If it is issued after that you will have to delete the used-to-be secondary database separately and pay the pro-rated daily cost of an extra database. The second method will automatically terminate the geo-replication, drop the secondary database and stop the billing for it in a single step.
Standard Geo-replication provides you with a powerful disaster recovery solution targeting the application with moderate update rates. It is not as flexible as active geo-replication but we believe it meets the needs of a large category of applications. Please tell us what you think! We’re listening closely to feedback so sign-up for the preview of the new service tiers and try standard geo-replication. You can read more about it in this article.