We recently carried out an engagement where a customer, needed to move their services from US South Central region to US East region data centers. The engagement included the processing of capacity planning an appropriate sized Premium SKU and leveraging Active Geo-Replication for High-Availability and Rollback.
The customer was looking to migrate their Azure services that reside in US South Central to US East region. The primary area required for assistance is around their database, where they need predictability of any potential downtime. The downtime should be as low as possible but can be no longer than 6 hours.
The customer is a developer of a geosocial mobile game. The game has awareness of towns and cities across the world, with a presence in all but 2 African countries. Players primarily participate in terms of their physical geographical location
From a planning perspective we considered multiple options. These weren’t particularly long discussions, but were ideas and are worth taking note of as they are inevitably typical of such conversations.
- How big is the database?
- What is the current SKU used for the database?
- How much downtime can you tolerate?
- How much data loss can you tolerate, if any, during the migration?
- How much money are you willing to spend?
- Can you tolerate the use of recently published services and/or services that are still in preview?
- What are the current concerns related to moving the data?
- What are you not worried about?
For this customer it was most important to move. So we had quite a fair amount of latitude in terms of what we could do.
The database was approximately 35GB. About 5GB of the data could be considered as needing to be online while the remainder could be added over time that is it would essentially be offline.
The database was a business edition SKU and didn’t leverage any specific Azure features, such as Federations.
They could tolerate up to 8 hours downtime, but managing expectations of business and consumers was much easier if the downtime was less than 2 hours. Additionally certain days of the week and hours of the day were more amenable to downtime, therefore control of the downtime was a desired requirement.
The customer was cost conscious but was open to all ideas provided they included awareness of cost considerations.
The customer was very cautious of the DBCopy process and felt that along with export/import for moving large sets of data took much longer than 10hours, which severely impacted the business.
The moving of the compute nodes was not a concern, and neither was there an immediate need to move the existing blob storage data. The solution is architected such that no cross DC logic is required for accessing the existing blobs
Some options included:
- Leverage Active Geo-Replication functionality
- Stop all connections, export data, copy files, import into US East region
- Create DBCopy, Export data, copy files, restore in US East region
The Active Geo-Replication option was the preferred choice because:
- Lowest downtime
- No data loss, during planned migration
- Rollback option is available
- A failure in execution is a simple step of re-enabling resource at US South Central data center
- Lowest TCO
- Aligned with future plans (Upgrade to Premium SKU)
The one concern was that the service was in Preview, but this was considered a tolerable risk. After reviewing their historical performance, it was decided that a P1 would be sufficient for their needs. Some concerns that we had were whether the replication would be too resource intensive but given that the peak was understood to be one time event and the foreseeable resource requirement was low, the decision was to remain with P1 and be ready to upgrade to P2 at a moment’s notice.
As part of the capacity planning process, the customer was pointed to article Basic, Standard and Premium Preview for Azure SQL Database (http://msdn.microsoft.com/en-us/library/azure/dn369873.aspx).
The customer had to sign up for the Premium SKU preview. This must be done through the portal, and it takes seconds for the process to complete.
Once signed up, the database can be upgraded to Premium. Refer to “Changing Database Service Tiers and Performance Levels” (http://msdn.microsoft.com/en-us/library/dn369872.aspx) for details of what was leveraged for planning. The below formula was actually quite accurate, in predicting about 12 hours of time spent upgrading.
3 * (5 minutes + database size / 150 MB/minute)
The upgrade occurred smoothly. At one point several connections failed to connect, but this was for a brief moment and expected, as mentioned in MSDN article.
The next step was to create a replica in the desired region, namely US East region. There was uncertainty about how long it would take, and uncertainty about the resources required to carry out the replica. The steps are well documented as per this link “Failover in an Active Geo-Replication Configuration” (http://msdn.microsoft.com/en-us/library/azure/dn741337.aspx). The customer said it best when responding to a question of how long did the initial synchronization of the replica take…
*Completely online experience
“It [Initial Synchronization] was no more than 40 minutes*. I'm not sure on actual copy time. This is quite amazing considering making a local bacpac file of this database takes 4 hours.”
It should be noted that this execution plan was undertaken in-between code deployments. Much of this could’ve occurred within a 24hour period if process was purely technical. The customer chose to wait a couple of days before moving to the next action of the execution plan. The reasoning was related to making a significant code change to rather occur on the existing topology than in a new data center. They were concerned about whether the DDL changes would replicate. All changes are replicated except for those that rely on updating the master database, such as logins. These must be done in each environment, whereas creating a user can be done at the source. Refer to “Security Configuration for Active Geo-Replication” (http://msdn.microsoft.com/en-us/library/azure/dn741335.aspx) which specifically relates to the topic of logins.
The last step was to switch to the new region. Below are the steps that the customer shared for the benefit of others that undertake such a task. For terminating the Active Geo-Replication topology, the instructions are well-described as per link “Terminate a Continuous Copy Relationship” (http://msdn.microsoft.com/en-us/library/azure/dn741323.aspx).
- A week in advance, upgrade database to premium and setup geo replication to the East data center. This could have been done the same day. It took less than 40 minutes for a 32GB database to fully replicate from US South Central to East.
- [Day of move] Create blob containers in East to match the names of the containers in US South Central. (we did not elect to move blob data at this time, but you would do that here)
- Deploy new Web and Worker roles to East. Point new deployment at DB and blob storage in East.
- Setup any needed SSL certs in new (East) environment
- (Downtime begins)Take old Web and new Web offline line. We do this with a configuration switch in the application that shows the users a friendly message that the site is under maintenance. Only admins can get into the site/API normally.
- Change DNS to point to new web roles IP addresses
- Deploy new code to old data center (US South Central ) with connection strings for DB and Blob pointing at East
- Configure East DB to allow connections from US South Central IP Address (may not be required)
- Set new deployment (in staging area) in US South Central to offline mode, swap with production area
- Stop DB Active Geo-Replication (Planned Termination of a Continuous Copy Relationship)
- Steps carried out using the portal
- Set US South Central DB into Read-Only mode
- Set US East region DB into read/write mode (after replication is completed) (may not be required)
- (Downtime ended) Re-enable both web sites/API in both US South Central and US East region data centers
- Setup automatic AzureDB backups on new DB.
- Using the automated export under the "configure" section on the DB in the management portal.
- After DNS replication is fully propagated around the world, delete application servers in US South Central.
The following quote from the customer does a great job in articulating the experience
“The move went very smoothly. We were down for about 1 hour, and most of that time was waiting for our new code to deploy, configuration updates and prod/staging swaps. The actual process of shutting down the Azure DB replication and changing read/write modes was simple.”