Introducing health integrated rollouts to Azure Deployment Manager

Posted on May 8, 2019

Principal Program Manager, Cloud and AI

At Ignite 2018, we unveiled Azure Deployment Manager, which allows you to perform staged rollouts of Azure Resource Manager resources, in preview for the first time. To monitor the health of your services, you can couple integrated health checks with these region by region rollouts, or as part of multiple phases within regions. We’re happy to announce that health integration features are now available in the Azure Deployment Manager Preview. These health integrated rollouts mean that if unacceptable signals are detected, deployment will automatically stop, allowing you to troubleshoot and reduce the scale of the impact. This exact set of tools is already used internally by hundreds of Microsoft services to carry out safe and reliable deployments, ensuring high availability and preventing or dramatically reducing service downtime caused by regressions in updates.

Azure Deployment Manager automates the steps of deploying to a pre-production environment, verifying service health over a period of time before progressively moving on to other environments and performing the necessary health checks. 

This feature is for you if:

  • You’re deploying your service across multiple regions 
  • You’re testing various configurations
  • You already use a health monitoring service of some kind

Determining service health

Health monitoring providers offer several mechanisms to monitor and alert you of any service health issues. Azure Monitor is an example of one such type of offering, which fires alerts when certain thresholds are exceeded. Some thresholds are indicative of a problem with service health when exceeded. For example, if you’re deploying a new update to your service and your memory and CPU utilization spike beyond expected levels, Azure Monitor and health monitoring providers like it will notify you of the issues so that you can take corrective action.

These health providers typically offer REST APIs so that the status of your service’s monitors can be examined programmatically. The REST APIs will either come back with a simple healthy/unhealthy signal (usually determined by the HTTP response code) or with detailed information about the signals it receives.

The new healthCheck step in Azure Deployment Manager allows you to declare HTTP codes that indicate a healthy service, or, for more complex REST results, you can even specify regular expressions that indicate a healthy response if they match. To make this even easier to use, the Azure Deployment Manager team is working closely with the top health monitoring providers to pre-author these HTTP codes and regular expressions so you can simply copy and paste this part of the healthCheck step if you’re using one of these providers. 

The process of getting set up with Azure Deployment Manager health checks is straightforward.

  1. Create your health monitors through a health service provider of your choice.
  2. Create one or more healthCheck steps as part of your Azure Deployment Manager rollout.
  3. Fill out the healthCheck steps with the following information:
    • The URI for the REST API for your health monitors (as defined by your health service provider)
    • Authentication information (only API-key style auth is currently supported)
    • HTTP status codes or regular expressions that define a healthy response
  4. Invoke the healthCheck steps at the appropriate time in your Azure Deployment Manager rollout.

Phases of a health check

At this point, Azure Deployment Manager knows how to query for the health of your service and at what phases in your rollout to do so. However, Azure Deployment Manager also allows for deep configuration of the timing of these checks. A healthCheck step is executed in a simple three-phase process with configurable durations, rich enough to support the health monitoring needs of the largest Microsoft services composing Azure itself.

Phases of a health check

Wait

  • After a deployment operation is completed, it may not make sense to check for service health yet, as the update has not reached a steady state. It takes time for services to start emitting health signals to be aggregated by the health monitoring provider into something useful, and VMs may be starting for the first time, rebooting, or re-configuring based on new data. Service health may be oscillating between healthy states and unhealthy states during this tumultuous process.
  • During the Wait phase, service health is not monitored in order to allow the deployed resources the time to bake before beginning the health check process.

Elastic

  • Since it is often impossible to know how long resources will take to bake before they become stable, the Elastic phase allows for a flexible time period between when the resources are potentially unstable and when they are required to maintain a healthy, steady state.
  • When the Elastic phase begins, Azure Deployment Manager will poll the provided REST endpoint for service health periodically. The polling interval is set to three minutes for the Azure Deployment Manager preview, but will be configurable in the future.
  • If the health monitor comes back with signals indicating that the service is unhealthy, those signals are ignored, the Elastic phase is maintained, and polling continues.
  • As soon as the health monitor comes back with signals indicating that the service is healthy, the Elastic phase ends and the HealthyState phase begins.
  • Thus, the duration specified for the Elastic phase is the maximum amount of time that can be spent polling for service health before a healthy response is considered mandatory.

HealthyState

  • During the HealthyState phase, service health is continually polled at the same interval as the Elastic phase.
  • The service is expected to maintain healthy signals from the health monitoring provider for the entire specified duration.
  • If at any point an unhealthy response is detected, Azure Deployment Manager will stop the entire rollout and return the REST response carrying the unhealthy service signals.
  • Once the HealthyState phase duration has ended, the healthCheck is complete, and deployment continues to the next step.

Next steps

Start using Azure Deployment Manager with a couple of resources: