Improving Azure Virtual Machines resiliency with Project Tardigrade
By Mark Russinovich Chief Technology Officer and Technical Fellow, Microsoft Azure
4 min read
“Our goal is to empower organizations to run their workloads reliably on Azure. With this as our guiding principle, we are continuously investing in evolving the Azure platform to become fault resilient, not only to boost business productivity but also to provide a seamless customer experience. Last month I published a blog post highlighting several initiatives underway to keep improving in this space, as part of our commitment to provide a trusted set of cloud services. Today I wanted to expand on the mention of Project Tardigrade – a platform resiliency initiative that improves high availability of our services even during the rare cases of spontaneous platform failures. The post that follows was written by Pujitha Desiraju and Anupama Vedapuri from our compute platform fundamentals team, who are leading these efforts.” Mark Russinovich, CTO, Azure
This post was co-authored by Mukhtar Ahmed, Senior Software Engineer; Pujitha Desiraju, Program Manager II; Jim Cavalaris, Principal Software Engineer; Gaurav Jagtiani, Principal Software Engineering Manager; and Anupama Vedapuri, Principal Program Manager
Codenamed Project Tardigrade, this effort draws its inspiration from the eight-legged microscopic creature, the tardigrade also known as the water bear. Virtually impossible to kill, tardigrades can be exposed to extreme conditions, but somehow still manage to wiggle their way to survival. This is exactly what we envision our servers to emulate when we consider resiliency, hence the name Project Tardigrade. Similar to a tardigrade’s survival across a wide range of extreme conditions, this project involves building resiliency and self-healing mechanisms across multiple layers of the platform ranging from hardware to software, all with a view towards safeguarding your virtual machines (VMs) as much as possible.
How does it work?
Project Tardigrade is a broad platform resiliency initiative which employs numerous mitigation strategies with the purpose of ensuring your VMs are not impacted due to any unanticipated host behavior. This includes enabling components to self-heal and quickly recover from potential failures to prevent impact to your workloads. Even in the rare cases of critical host faults, our priority is to preserve and protect your VMs from these spontaneous events to allow your workloads to run seamlessly.
One example recovery workflow is highlighted below, for the uncommon event in which a customer initiated VM operation fails due to an underlying fault on the host server. To carry out the failed VM operation successfully, as well proactively prevent the issue from potentially affecting other VMs on the server, the Tardigrade recovery service will be notified and will begin executing failover operations.
The following phases briefly describe the Tardigrade recovery workflow:
This step has no impact to running customer VMs. It simply recycles all services running on the host. In the rare case that the faulted service does not successfully restart, we proceed to Phase 2.
Our diagnostics service runs on the host to collect all relevant logs/dumps systematically, to ensure that we can thoroughly diagnose the reason for failure in Phase 1. This comprehensive analysis allows us to ‘root cause’ the issue and thereby prevent reoccurrences in the future.
At a high level, we reset the OS into a healthy state with minimal customer impact to mitigate the host issue. During this phase we preserve the states of each VM to RAM, after which we begin to reset the OS into a healthy state. While the OS swiftly resets underneath, running applications on all VMs hosted on the server briefly ‘freeze’ as the CPU is temporarily suspended. This experience is similar to a network connection temporarily lost but quickly resumed due to retry logic. After the OS is successfully reset, VMs consume their stored state and resume normal activity, thereby circumventing any potential VM reboots.
With the above principles we ensure that the failure of any single component in the host does not impact the entire system, making customer VMs more immune to unanticipated host faults. This also allows us to recover quickly from some of the most extreme forms of critical failures (like kernel level failures and firmware issues) while still retaining the virtual machine state that you care about.
Currently we use the aforementioned Tardigrade recovery workflow to catch and quickly recover from potential software host failures in the Azure fleet. In parallel we are continuously innovating our technical capabilities and expanding to different host failure scenarios we can combat with this resiliency initiative.
We are also looking to explore the latest innovations in machine learning to harness the proactive capabilities of Project Tardigrade. For example, we plan to leverage machine learning to predict more types of host failures as early as possible. For example, to detect abnormal resource utilization patterns of the host that may potentially impact its workloads. We will also leverage machine learning to help recommend appropriate repair actions (like Tardigrade recovery steps, potentially live migration, etc.) thereby optimizing our fleetwide recovery options.
As customers continue to shift business-critical workloads onto the Microsoft Azure cloud platform, we are constantly learning and improving so that we can continue to meet customer expectations around interruptions from unplanned failures. Reliability is and continues to be a core tenet of our trusted cloud commitments, alongside compliance, security, privacy, and transparency. Across all of these areas, we know that customer trust is earned and must be maintained, not just by saying the right thing but by doing the right thing. Platform resiliency as practiced by Project Tardigrade is already strengthening VM availability by ensuring that underlying host issues do not affect your VMs.
We will continue to share further improvements on this project and others like it, to be as transparent as possible about how we’re constantly improving platform reliability to empower your organization.