Improvement to allow full user control over failover to the secondary database.
In this post we will continue the conversation about the business continuity scenarios and discuss the newly released Standard Geo-Replication feature of Azure SQL Database. Standard Geo-Replication allows you to configure your database to asynchronously replicate committed transactions from the primary database to the secondary in a predefined Azure region. Before diving into the details it would be useful to summarize the full range of the business continuity features now available in preview and discuss them in the context of the new service tiers we announced in April 2014. You can find the details of the specific characteristics of the service tiers in this article.
Business continuity model
In the previous post on Active Geo-Replication, Tobias already defined the business continuity challenge. As a quick refresher there are a few concepts you need to be familiar with to get the most out of this post:
- Disaster recovery (DR): a process of restoring the normal business function of the application
- Point in time restore: the ability to restore the database to a point in time in the past (within the backup retention period) in order to recover from data corruption caused by a human mistake or programmatic error
- Estimated Recovery Time (ERT): The estimated duration for the database to be fully functional after a restore/failover request.
- Recovery Point Objective (RPO): The amount of most recent data changes (time interval) the application could lose after recovery.
The following table shows the differences of the business continuity features across the service tiers:
|Point In Time Restore||Any restore point within 7 days||Any restore point within 14 days||Any restore point within 35 days|
|Geo-Restore||ERT < 12h RPO < 1h||ERT < 12h RPO < 1h||ERT < 12h RPO < 1h|
|Standard Geo-Replication||Not included||ERT < 30s RPO < 5s||ERT < 30s RPO < 5s|
|Active Geo-Replication||Not included||Not included||ERT < 30s RPO < 5s|
The key factor defining the Business Continuity/Disaster Recovery (BCDR) model is that an application’s performance profile can drive which business continuity solution will meet its needs. If your application processes low volumes of data with a low rate of updates and the Basic tier provides the appropriate performance level then geo-restore should meet its recovery needs. If your application processes higher volumes of data and requires the higher performance levels of Standard tier and a more aggressive RPO, then standard geo-replication will likely be the right disaster recovery solution. Active geo-replication has been designed for applications that process very high volumes of reads and writes and would like to use the secondaries to achieve both higher resiliency to failures and higher read scale by load balancing the read operations. In this way SQL Database’s BCDR model provides increasingly robust business continuity features as you scale from Basic to Premium.
How is Standard Geo-replication different from Active Geo-replication?
Let’s take a closer look at the user experience and how it differs from active geo-replication.
First of all, standard geo-replication is built on the same technology as active geo-replication but is optimized for applications that use geo-replication only to protect the application from regional failures. The following list shows how standard geo-replication is different from active geo-replication:
- Only one secondary database can be created in a Microsoft defined “DR paired” Azure region. The list of the DR pairs can be found here.
- The secondary is visible in the master database but cannot be directly connected to until failover is completed (offline secondary).
- The secondary database is charged at a discounted rate as it is not readable (offline).
As a result some scenarios you could enable using active geo-replication are not possible with the standard geo-replication. The following table summarizes this delta:
|Scenario||Standard Geo-replication||Active Geo-replication|
|Online application upgrade||No||Yes|
|Online application relocation||No||Yes|
|Read load balancing||No||Yes|
Figure 1 depicts a typical configuration of Standard Geo-replication:
Figure 1. A database can have one offline secondary in the DR paired region.
Why would I use Standard Geo-replication instead of Active Geo-replication with a Premium database?
Premium databases can be protected by either using standard geo-replication or active-geo-replication. So when would you choose to use standard geo-replication over the more powerful active geo-replication? Standard geo-replication has been designed for applications that are using geo-replication only to achieve disaster recovery SLA. If the application has a high-volume read-oriented workload and could benefit from read-scale load balancing in addition to fast disaster recovery active geo-replication is a better fit.
Standard geo-replication is designed specifically to provide a DR solution with low downtime for data tier regional outages.
If a region has an extended outage Microsoft you will receive a alert in the Portal and will see your SQL Database servers’ state set to Degraded. At that point an application has a choice of initiating the failover or waiting for the datacenter to recover. If your application needs to optimize for higher availability and can tolerate RPO of 5 seconds then it should failover as soon as you receive an alert or detect a database connectivity failures. If your application is sensitive to data loss you may opt to wait for the SQL Database service to recover. If this happens no data loss will occur. In case you initiate the failover the database you must reconfigure your applications appropriately to connect to the new primary databases.
Once you have completed the failover you will want to ensure that the new primary is also protected as soon as possible. Since primary region recovery may take time you will have to wait for your server to change from Degraded back to Online status. This will allow you to initiate geo-replication from the new primary to protect it. Until seeding of the new secondary is completed your new primary will remain unprotected. After seeding is completed, the DR configuration may look like the one depicted in Figure 2:
Figure 2. Application can create a new secondary database after failover.
Disaster Recovery Drills
Because database failover is associated with data and is a disruptive process, the failover workflow should be periodically tested in order to ensure the application’s readiness. This process is called a DR drill. In addition to being a good engineering practice it is also required by most industry security standards as part of compliance certification. You can test the overall DR workflow by stopping geo-replication from the secondary database at any time. Note that if the primary is active at the point of termination any transactions committed on the primary but not yet replicated to the secondary will be lost. After termination the secondary will become fully accessible and the application can use it as the new primary. Due to the possibility of data loss and that during the drill the primary will not be protected, we don’t recommend performing DR drills on production databases. Instead we recommend creating a test copy of a database in the same region, creating a secondary of the copy and then using the copy and its secondary to verify the application’s failover workflow in a test context.
To manage standard geo-replication you can use the same API as with Active geo-replication, including PowerShell cmdlets and REST API or use the Azure Management Portal. The easiest way to enable standard geo-replication is by using the geo-replication tab in the Azure Management Portal as shown on Figure 3.
Figure 3. Use the Azure Management portal to create and monitor the status of the offline geo-secondary.
Note that you can opt out from geo-replication at any time. There are two ways of doing it. You can either terminate the geo-replication relationship on the primary or secondary database. The first method is most useful as a way to cancel the operation initiated by mistake. If it is issued before the secondary database seeding is completed you will not be billed for the secondary database. If it is issued after that you will have to delete the used-to-be secondary database separately and pay the pro-rated hourly cost of an extra database. The second method will automatically terminate the geo-replication, drop the secondary database and stop the billing for it in a single step.
Standard Geo-replication provides you with a powerful disaster recovery solution targeting the application with moderate update rates and interested in dramatically reducing the downtime during the regional outages. Please tell us what you think! We’re listening closely to the feedback. You can read more about Standard Geo-replication in this article.