• <1 minute

Improving Azure Virtual Machine resiliency with predictive ML and live migration

We’re committed to ensuring that you can run your workloads reliably on Azure. One of the areas we’re investing heavily into optimizing reliability is using the combination of machine learning and live migration to predict and proactively mitigate potential failures.

This post was co-authored by Catherine Burke, Niko Pamboukas, Yingnong Dang, Omar Khan, and Alistair Speirs from the Microsoft Azure team.

We’re committed to ensuring that you can run your workloads reliably on Azure. One of the areas we’re investing heavily into optimizing reliability is using the combination of machine learning and live migration to predict and proactively mitigate potential failures.

Since early 2018, Azure has been using live migration in response to a variety of failure scenarios such as hardware faults, as well as regular fleet operations like rack maintenance and software/BIOS updates. Our initial use of live migration to handle failures gracefully allowed us to reduce the impact of failures on availability by 50 percent.

To further push the envelope on live migration, we knew we needed to look at the proactive use of these capabilities, based on good predictive signals. Using our deep fleet telemetry, we enabled machine learning (ML)-based failure predictions and tied them to automatic live migration for several hardware failure cases, including disk failures, IO latency, and CPU frequency anomalies.

We partnered with Microsoft Research (MSR) on building our ML models that predict failures with a high degree of accuracy before they occur. As a result, we’re able to live migrate workloads off “at-risk” machines before they ever show any signs of failing. This means VMs running on Azure can be more reliable than the underlying hardware.

From a VM perspective, live migration should have minimal impact and in fact, none of our customers have reported any issues stemming from their VM being live migrated. The VM state and all network connections are preserved during live migration. At the final phase of the live migration, VMs are paused for a few seconds and moved to their new hosts. Some rare performance sensitive workloads may also notice a slight degradation in the few minutes leading up to the VM pause.

Hardware failure prediction

Initially, we have focused our ML models on predicting disk failures, which are a top driver of hardware faults. Predicting disk failures in an environment as large as Azure is complicated, and we had to overcome multiple challenges to be successful. Our disk prediction model had to consider:

  • A wide variety of health signals: Some examples include guest VM performance degradation, host operating system behavior, and disk telemetry.
  • Different customer workloads: Different workloads exhibit different symptoms of disk failures. Disk-intensive workloads may see disk failure happen shortly after an early symptom is observed, while a node with relatively less disk-intensive workloads may not see this for a few weeks or months.
  • Different disk manufacturers: The behavior and failure patterns of disks vary from manufacturer to manufacturer and even model-to-model.
  • Imbalanced failure rates: In general, only 1 out of 10,000 nodes exhibit signs of disk failures. Classical machine learning approaches do not handle such imbalances well.

Addressing these challenges required us to design an approach that was both holistic in how it gathered signals, and flexible enough to resist false positives.

First, we use telemetry at both system and disk levels. System-level events include Host IO performance counters and system events. Disk level signals leverage S.M.A.R.T data (a standard disk telemetry data format). We leverage comprehensive feature engineering approaches to learn from heterogeneous signals.

Second, we treat the problem as a ranking problem instead of a classification problem. After ranking the disk failure probabilities, we use an optimization model to identify the top N disks with the highest likelihood of failing (N is determined by the estimated optimal cost/benefit tradeoff).

Third, we had to customize the ranking approach to factor in the heterogeneity of the signals as well as their correlation, which renders less sophisticated model training and validation approaches unusable.

The following is a real example from October 30, 2018 in which our disk failure prediction helped to protect real customer workloads:

  • At 01:59:26, we predicted that a disk had a high probability of failure. This failure could impact the five VMs that were running on the node.
  • At 02:10:38, we started to use live migration to migrate these five VMs off the node. The blackout time ranged from 0.1 to 1.6 seconds.
  • The node was then removed from production for detailed diagnostics.
  • At 06:20:34, the node failed the disk stress test and was sent for repair.

For more details on our innovative use of ML please see our papers on disk failure prediction and node failure prediction.

How live migration works behind the scenes

At a high level, live migration consists of three main phases: pre-migration, brownout, and blackout. In the pre-migration phase, the live migration orchestrator chooses the best destination node, exports VM configurations, and sets up authorization. During this phase, the VM remains running on the source node with no impact to availability or performance. Next is the brownout phase, during which the memory and disk state is transferred from the source node to the destination node. During this phase, the VM is still running, but there may be minor performance degradation due to the additional work being done.

The length of the brownout depends on VM size (particularly memory and disk) and the rate at which memory is changing. It is typically in the order of minutes – for our most common VM sizes the brownout ranges from 1-30 minutes. The final phase of live migration is the blackout phase. Once the brownout phase ends, the VM is in a suspended state on both the source and destination nodes.

The Azure live migration agent transfers additional Azure-specific state information before starting the destination VM. The length of the blackout depends on the amount of VM state remaining to be transferred after the VM is paused. As was the case in the example above, the blackout is generally in the range of low single-digit seconds or less.

Limitations

While we are proud of these platform achievements, we always know there is more to be done. We are progressively increasing the scenarios in which we use live migration and expanding its technical capabilities. The following scenarios are not yet supported with live migration:

VM availability is critical to our customers’ success on the azure platform, and machine learning and live migration are pivotal to Azure’s commitment to this customer promise. We use live migration to perform platform updates transparently and to recover from a variety of hardware and software faults. Machine learning has been instrumental in increasing the effectiveness of live migration. While there are technical limitations to live migration, we are continuously prioritizing improvements.

Our goal is to ensure that VMs are never interrupted due to underlying platform issues and operations. In future posts, we will share further improvements to our resilient foundation that allows us to maximize the availability of your applications running in Azure.