Business continuity modelIn the previous post on Active Geo-Replication, Tobias already defined the business continuity challenge. As a quick refresher there are a few concepts you need to be familiar with to get the most out of this post:
- Disaster recovery (DR): a process of restoring the normal business function of the application
- Point in time restore: the ability to restore the database to a point in time in the past (within the backup retention period) in order to recover from data corruption caused by a human mistake or programmatic error
- Estimated Recovery Time (ERT): The estimated duration for the database to be fully functional after a restore/failover request.
- Recovery Point Objective (RPO): The amount of most recent data changes (time interval) the application could lose after recovery.
|Point In Time Restore||Any restore point within 7 days||Any restore point within 14 days||Any restore point within 35 days|
|Geo-Restore||ERT < 12h RPO < 1h||ERT < 12h RPO < 1h||ERT < 12h RPO < 1h|
|Standard Geo-Replication||Not included||ERT < 30s RPO < 5s||ERT < 30s RPO < 5s|
|Active Geo-Replication||Not included||Not included||ERT < 30s RPO < 5s|
How is Standard Geo-replication different from Active Geo-replication?Let’s take a closer look at the user experience and how it differs from active geo-replication. First of all, standard geo-replication is built on the same technology as active geo-replication but is optimized for applications that use geo-replication only to protect the application from regional failures. The following list shows how standard geo-replication is different from active geo-replication:
- Only one secondary database can be created in a Microsoft defined “DR paired” Azure region. The list of the DR pairs can be found here.
- The secondary is visible in the master database but cannot be directly connected to until failover is completed (offline secondary).
- The secondary database is charged at a discounted rate as it is not readable (offline).
|Scenario||Standard Geo-replication||Active Geo-replication|
|Online application upgrade||No||Yes|
|Online application relocation||No||Yes|
|Read load balancing||No||Yes|
Figure 1. A database can have one offline secondary in the DR paired region.
Why would I use Standard Geo-replication instead of Active Geo-replication with a Premium database?Premium databases can be protected by either using standard geo-replication or active-geo-replication. So when would you choose to use standard geo-replication over the more powerful active geo-replication? Standard geo-replication has been designed for applications that are using geo-replication only to achieve disaster recovery SLA. If the application has a high-volume read-oriented workload and could benefit from read-scale load balancing in addition to fast disaster recovery active geo-replication is a better fit.
Database failoverStandard geo-replication is designed specifically to provide a DR solution with low downtime for data tier regional outages. If a region has an extended outage Microsoft you will receive a alert in the Portal and will see your SQL Database servers’ state set to Degraded. At that point an application has a choice of initiating the failover or waiting for the datacenter to recover. If your application needs to optimize for higher availability and can tolerate RPO of 5 seconds then it should failover as soon as you receive an alert or detect a database connectivity failures. If your application is sensitive to data loss you may opt to wait for the SQL Database service to recover. If this happens no data loss will occur. In case you initiate the failover the database you must reconfigure your applications appropriately to connect to the new primary databases. Once you have completed the failover you will want to ensure that the new primary is also protected as soon as possible. Since primary region recovery may take time you will have to wait for your server to change from Degraded back to Online status. This will allow you to initiate geo-replication from the new primary to protect it. Until seeding of the new secondary is completed your new primary will remain unprotected. After seeding is completed, the DR configuration may look like the one depicted in Figure 2:
Figure 2. Application can create a new secondary database after failover.
Disaster Recovery DrillsBecause database failover is associated with data and is a disruptive process, the failover workflow should be periodically tested in order to ensure the application’s readiness. This process is called a DR drill. In addition to being a good engineering practice it is also required by most industry security standards as part of compliance certification. You can test the overall DR workflow by stopping geo-replication from the secondary database at any time. Note that if the primary is active at the point of termination any transactions committed on the primary but not yet replicated to the secondary will be lost. After termination the secondary will become fully accessible and the application can use it as the new primary. Due to the possibility of data loss and that during the drill the primary will not be protected, we don’t recommend performing DR drills on production databases. Instead we recommend creating a test copy of a database in the same region, creating a secondary of the copy and then using the copy and its secondary to verify the application’s failover workflow in a test context.
Management ToolsTo manage standard geo-replication you can use the same API as with Active geo-replication, including PowerShell cmdlets and REST API or use the Azure Management Portal. The easiest way to enable standard geo-replication is by using the geo-replication tab in the Azure Management Portal as shown on Figure 3.
Figure 3. Use the Azure Management portal to create and monitor the status of the offline geo-secondary.Note that you can opt out from geo-replication at any time. There are two ways of doing it. You can either terminate the geo-replication relationship on the primary or secondary database. The first method is most useful as a way to cancel the operation initiated by mistake. If it is issued before the secondary database seeding is completed you will not be billed for the secondary database. If it is issued after that you will have to delete the used-to-be secondary database separately and pay the pro-rated hourly cost of an extra database. The second method will automatically terminate the geo-replication, drop the secondary database and stop the billing for it in a single step.