This is the first in a short series of posts focusing on business continuity and disaster recovery. In this post we’ll look at business continuity scenarios in general and then spotlight use of Active Geo-replication, now available in the Premium service tier of SQL Database. And for more on active geo-replication take a look this lively and informative Channel 9 video in which Sasha Nosov and Scott Klein explore how it works and how it helps address real business scenarios.
Business continuity – what is it?
Business continuity refers to the mechanisms, policies and procedures that enable a business to continue operating in the face of disruption, particularly to its computing infrastructure.
From a database perspective, there are four major potential disruption scenarios:
- Local hardware or software failures affecting the database node such as a disk-drive failure.
- Data corruption or deletion – typically caused by an application bug or human error. Such failures are intrinsically application-specific and cannot as a rule be detected or mitigated automatically by the infrastructure.
- Datacenter outage, possibly caused by a natural disaster. This scenario requires some level of geo-redundancy with application failover to an alternate datacenter.
- Upgrade or maintenance errors – unanticipated issues that occur during planned upgrades or maintenance to an application or database may require rapid rollback to a prior database state.
How does Azure SQL Database enable business continuity?
Azure SQL Database has been built from the ground up as a robust highly available database service using a system in which three or more replicas of each database are maintained at all times, with updates committed to at least two replicas before responding. This HA system addresses the first scenario, local hardware and software failures.
The new service tiers, Basic, Standard, and Premium, currently in preview, take business continuity to the next level, with features that address the remaining three scenarios above:
Data corruption or deletion. Basic, Standard, and Premium databases support a self-service point-in-time restore capability that allows a database to be restored to an earlier state. This protects against application or user errors that result in data corruption or deletion. Full backups are taken weekly, differential backups daily and transaction log are backed up every 5 minutes. For Basic databases, backups are retained for 7 days, for Standard they are retained for 14 days, and for Premium for 35 days. You can restore a database, including a recently deleted database, to any point in time during its retention period.
Datacenter outage. Basic, Standard and Premium databases service tiers are also protected against a prolonged datacenter outage which may require a database to be recovered to another datacenter in another region. Three geo-redundancy solutions are offered:
- Geo-restore* for Basic, Standard and Premium enables recovery to a geo-redundant copy of the last daily backup. [*coming to preview soon]
- Standard geo-replication* for Standard and Premium databases extends the local HA system to create and maintain additional secondary replicas in a paired region. These secondaries are offline and not accessible unless there is a data center disruption, at which point they are brought online so that applications can be failed over to them. [*coming to preview soon]
- Active geo-replication, available for Premium databases, provides the richest solution with the least risk of data loss and the most rapid recovery time. It extends standard geo-replication with up to 4 geo-replicated secondaries that are online and readable at all times, and which can also be used for load balancing or to provide low-latency access to replicated data anywhere in the world. Active geo-replication is available now in preview.
Upgrade or maintenance errors. Using active geo-replication you can create a continuously replicated copy of a database that can be frozen immediately prior to applying updates or maintenance to the database or an application. Should any errors be detected during or after the process it is then easy to fallback quickly to this copy.
And while different solutions are available with different service tiers it is important to remember that it is easy to upgrade or downgrade a database between service tiers. So you might, for example, choose to upgrade a Standard database to Premium and use active geo-replication before applying an important upgrade. Once the upgrade is complete the database can be downgraded again to reduce its cost.
Active Geo-replication in detail
Now that we can see where active geo-replication fits in the business continuity landscape, let’s take a closer look at how it works and how it can be used.
Figure 1. A Premium database can have up to four readable secondaries in the same or different regions.
Active geo-replication relationships can be created and managed via the Azure management portal, PowerShell, or the REST API. In the portal, you can manage the replication relationship from either the primary or secondary. From the primary you can monitor the replication status of each of the secondaries.
Figure 2. Use the Azure Management portal to create and monitor the status of up to four geo-secondaries.
Up to four readable secondaries can be created, each with the same name as the primary but located on a different server. Secondaries when first created are seeded with the current state of the primary. Once each secondary has ‘caught up’ it is then maintained as a continuous copy of the primary. Secondary databases, like all databases, are also protected locally using the normal HA system.
Unlike the local HA replication model, geo-replication from the primary to the secondary is asynchronous. Transactions applied to the primary are copied to and applied to the secondary but the primary is not blocked while waiting for this to occur. Changes are buffered making the replication system resilient to temporary connection problems or high-latency when replicating to a distant location.
To ensure that transactions being applied to the secondaries do not bottleneck the primary the secondary must be given the same or higher performance level as the primary.
Secondaries are readable so can support independent read-only workloads. This feature can be used to load balancing complex query workloads across multiple databases or to provide lower latency data access to applications in other parts of the world.
Figure 3. If a datacenter outage affects the primary, replication relationships can be terminated and applications failed over to a secondary.
Replication relationships are manually managed, allowing you to terminate a relationship at any point. If you terminate from the primary then you can choose whether to terminate immediately and lose any pending transactions or to terminate after applying all pending transactions.
In the case of a datacenter outage affecting the primary, failover is still a manual task, allowing you full control of if and when this is done. Terminating the relationship is done from a secondary database as the primary database will be unavailable. Terminating from the secondary is always immediate and will lose any transactions that had not been replicated at the point the primary became unavailable. How much if any data you might lose will depend on how active the primary was at the point it failed and what if any buffering of transactions was occurring across the connection. A decision to terminate the replication relationship should balance your concern for possible data-loss and your desire to get applications back up again.
Once you have terminated a relationship to a secondary database it becomes a normal read-write database. At this point you can failover application(s) which will have full access to the database. As each secondary is a discrete database with the same name as the primary but in a different server you will need to reconfigure your application(s) with an updated connection string.
Once you have managed the failover process you will want to reconstruct the same pattern of geo-replication relationships you were using previously only with the new production database as the primary. This will ensure you have the geo-redundancy and load-balancing that your applications and business continuity policies require.
Figure 4. After terminating the original relationship to the secondary, new geo-replication relationships should be created to protect the new primary and support any load balancing requirements.
Active geo-replication can be integrated into various application architecture patterns that make different assumptions about which layers and components are most at risk and which components might be geographically distributed. See this article for more information on this topic.
Summary
Active geo-replication not only provides powerful geo-redundancy features to protect a database from a datacenter outage but also enables other business continuity scenarios. Active geo-replication is available today in preview with Premium databases.
Read more about business continuity in SQL Database including active-geo replication, or take a few minutes to watch Sasha and Scott on Channel 9 discussing how active geo-replication can help you protect your business.
If you want to try out active geo-replication or the other business continuity features as they become available you will need to sign-up for the preview of the new service tiers. We’re listening closely to feedback so please try it out and let us know what you think.