• <1 minute

Reduce RTO by using Azure Traffic Manager with Azure Site Recovery

When an administrator is setting up an application for disaster recovery, one of the major goals that he or she has is to keep the recovery time objective (RTO) as low as possible.

When an administrator is setting up an application for disaster recovery, one of the major goals that he or she has is to keep the recovery time objective (RTO) as low as possible. Unavailability of a critical application even for a small time can lead to significant loss of business. Annual Report 2014 for State of Global Disaster Recovery Preparedness estimates the cost of losing critical applications for an hour to be more than $300,000.

The significant factors that contribute to the time required for an application to become available on the recovery site are as follows:

  1. Recovery is cumbersome, error prone, and manual. Azure Site Recovery provides a one click orchestration to automate the recovery process. ASR also allows the failover to be tested using Test Failover
  2. DNS update of the new address on which the application now runs takes a lot of time. ASR and Azure Traffic Manger combined provide best in class solution to reduce this time. Let me show you how.

First of these two factors depends on the complexity and size of the application you are recovering. But the time taken because of second factor is much more unpredictable because the action can span across organization and team boundaries and can be manual. Azure Traffic Manager reduces the time because of the second factor to a large extent. We will quantitatively look at it in the later part of this blog.

To illustrate the use of Traffic Manager we will use the example of a public website called www.contoso.com that is currently hosted on the primary site on-premises and is protected using Azure Site Recovery. We will look at how Traffic Manager should be setup in case recovery site is on-premises and also when Azure is used as the recovery site.

Setting up Azure Traffic Manager

1. Make the following entry in a public DNS Server

www.contoso.com     IN     CNAME      contoso.trafficmanager.net

2. Create a new Traffic Manger Profile from Azure Management Portal. You can read up more about creating and configuring Azure Traffic Manager profile. Name the traffic manager profile ‘contoso’ and choose the Load Balancing Method as ‘Failover’. The ‘Failover’ load balancing method directs all traffic to first configured Traffic Manager endpoint, unless that is found to be unhealthy in which case the traffic is directed to next endpoint and so on.

3. Configure the traffic manager with two external endpoints. You can read up more about configuring Azure Traffic Managers with external endpoints. Below are the powershell cmdlets to be used to configure the profile that was created in Step-2

$profile = Get-AzureTrafficManagerProfile -Name "contoso"

Add-AzureTrafficManagerEndpoint -TrafficManagerProfile $profile -DomainName "primary.contoso.com" -Status "Enabled" -Type "Any" | Set-AzureTrafficManagerProfile

Add-AzureTrafficManagerEndpoint -TrafficManagerProfile $profile -DomainName "recovery.contoso.com" -Status "Enabled" -Type "Any" | Set-AzureTrafficManagerProfile

The above powershell cmdlets can be used when both the primary and recovery sites are on-premises. In case you are using Azure as the recovery site, use the following cmdlet to setup the secondary endpoint

Add-AzureTrafficManagerEndpoint -TrafficManagerProfile $profile -DomainName “azureapp.cloudapp.net" -Status "Enabled" -Type "Any" | Set-AzureTrafficManagerProfile

There are two things to be noted here:

  • ASR creates a cloud service only at the time of failover. Therefore, if you are configuring the Traffic Manager profile before a failover, the cloud service won’t exist. Because of this we are setting up the Azure endpoint also as an ‘External‘ end point.
  • ASR creates a cloud service using the name of the recovery plan. For example, if the name of your recovery plan is AzureApp. ASR would create a cloud service named azureapp.cloudapp.net . If the name is not available then ASR adds a GUID to disambiguate it. Therefore it is recommended that you create a recovery plan with a name that is unlikely to be used by anyone else and you can predict the name of the cloud service created after failover with high confidence.

4. Make following entries in public DNS

primary.contoso.com     A    160.220.220.10*
recovery.contoso.com    A     170.220.220.10*

* The IPs used here are only for representative purpose

These are the IP addresses of your primary and recovery sites.

In case Azure is being used as the recovery site, you don’t need to make any entry corresponding to ‘recovery’. Azure will give a public IP to the cloud service when it is created.

5. By default TTL of a Traffic Manager profile is configured to 300 seconds. You can configure it to as low as 30 seconds. This TTL will contribute to the RTO of your application. We will look at this in detail in the next section.

Below is a pictorial representation of Traffic Manager setup with ASR. This shows the state before a failover. Red arrows represent DNS queries, dotted red arrows represent target of DNS response returned and green arrows represent application access.

BeforeFO

 

The picture below shows the state after a failover. Red arrows represent DNS queries, dotted red arrows represent target of DNS response returned and green arrows represent application access.

AfterFO

Time to Live (TTL)

Time to Live (TTL) is the value for which a DNS entry would be cached by a client. For a particular record, DNS would not be queried twice within the span of TTL.

Each DNS record has a TTL associated with it. In this case we have setup three DNS records.

  1. www.contoso.com – Let us assume that the TTL associated with this record is TTL1
  2. contoso.trafficmanager.net – Let us assume that the TTL associated with this record is TTL2
  3. primary.contoso.com, recovery.contoso.com, azurerp.cloudapp.net – Let us assume that that the TTL associated with these records is TTL3.

The first record would never change. www.contoso.com would always point to contoso.trafficmanager.net. Therefore, it won’t harm if you use a high value of TTL1. TTL2 determines how frequently the traffic manager would be queried by the clients. The records mentioned in the 3rd point are also not expected to change once they are setup. Therefore, a high value could be used for TTL3 as well.

Since, the records mentioned in 1st and 3rd point above never change only TTL2 is of interest. RTO of the application would only depend on TTL2.

Traffic Manager takes around 2 minutes to refresh its state. You can read about it in more detail at Azure Traffic Manager monitoring. It implies that once the application has been failed over to recovery site, it will take around 2 minutes for Traffic Manager to find that out. Add to it TTL2 which is the time for which the clients won’t query Traffic Manager again. RTO of your application because of DNS change in this case would be 2 minutes + TTL2. If you have setup TTL2 as 30 seconds, RTO would be 2.5 minutes.

Benefits of using Azure Traffic Manager

Setting up Traffic Manager the way described in this blog gives following benefits:

1. In most organizations, adding or modifying DNS records is usually handled either by a separate team or by someone outside the organization. This makes the task of altering DNS records very challenging. SLA promised by the team managing DNS infrastructure might vary from organization to organization and would impact the RTO of the application. In most cases, it is the the promised SLA would be in hours rather than minutes.

2. It frontloads all the work that has to be done with respect to DNS. No manual or scripted action is required at the time of actual failover. This helps in two ways

  • Without Traffic Manager a lot of time is spent in deciding that a change in the DNS has to be done. This increases the RTO of the application. With Traffic Manager, this step is automated
  • Without Traffic Manager, there is a high risk of an error happening because of a scripted or manual action of DNS update. With Traffic Manager, this risk is removed. Traffic Manager would continue to monitor the endpoints defined in the profile and switch the traffic to new site once the failover is complete.
  • With Traffic Manager, even the failback step is automated, without that failback also has to be separately managed.

Because of the above factors we can reduce the impact on RTO because of DNS changes to as low as 2.5 minutes.

Impact of DNS Resolvers

When a DNS query is made, it doesn’t always go up to the authoritative DNS Server. There could be one or more DNS Resolvers in between that cache the records.

Does the TTL experienced by the client increase because of the number of DNS resolvers between the client and the authoritative DNS server?

The answer is No. DNS resolvers ‘count down’ the TTL and only pass on a TTL value that reflects the elapsed time since the record was cached.

To illustrate this we will use an example where there is DNS resolver between the client and the authoritative DNS.

DNSResolver1

  • If DNS for a particular record is 300 seconds, at time T is DNSResolver1 queries Authoritative DNS, it will get TTL as 300 seconds
  • At time = T + 100 if client queries DNSResolver1, it will get TTL as 300 – 100 = 200 seconds

You can see that even if one more DNS resolver is added in the chain the same logic can be extended to it.

Therefore the number of DNS Resolvers in the chain doesn’t impact the TTL. The record would get refreshed at client after the TTL irrespective of the number of DNS Resolvers in the chain.

Monitoring applications using Azure Traffic Manager

Below are examples of how to configure Azure Traffic Manger in two specific application, Sharepoint and Exchange OWA. This may vary based on the particular application that you are setting up for disaster recovery using ASR.

Authenticated Microsoft Sharepoint

As of today, Traffic Manager doesn’t have the capability to monitor authenticated endpoints. If your organization has setup a sharepoint server, it is likely that user access to it would be authenticated. Traffic Manager expects HTTP code 200 as a result of the url being queried. If the sharepoint url requires authentication then it won’t return HTTM code 200. To work around this, you can setup a dummy website on a different port, say 8080, on the same IIS server that is hosting web frontend of the sharepoint server. This website should not require authentication. Instead of monitoring sharepoint running on port 80, Traffic Manager profile would have to be appropriately configured based on the port on which the dummy website has been hosted

Microsoft Exchange OWA

If you have setup Outlook Web Access (OWA) for your Exchange server, it is usually accessed by a url similar to www.contoso.com/owa. This url when accessed returns a non 200 HTTP code. Therefore, if you configure Traffic Manager profile to monitor ‘/owa’ url then it won’t work. Therefore it is recommended that you monitor ‘/’ using Traffic Manager.

To recap, in this blog we looked at how Azure Traffic Manager in conjunction with Azure Site Recovery can be used to significantly reduce Recovery Time Objective of applications and save millions of dollars that could be lost because of the downtime of critical applications.

If you have further questions, please visit the Azure Site Recovery forum on MSDN for additional information and to engage with other customers.

You can also check out additional product information, and sign-up for a free Azure trial to start trying out Microsoft Azure using Azure Site Recovery.

Acknowledgement

Special thanks to Jonathan Tuliani of Azure Traffic Manager Team, for his significant help in completing this blog.