Skip to main content Explore View all products (200+) Microsoft Foundry Azure Copilot GitHub Copilot Azure Kubernetes Service (AKS) Azure Cosmos DB Azure Database for PostgreSQL Azure Arc Microsoft Fabric Linux virtual machines in Azure Foundry Models Foundry Agent Service Foundry IQ Foundry Tools Foundry Control Plane Observability in Foundry Control Plane Azure OpenAI in Foundry Models Azure Speech in Foundry Tools Azure Machine Learning View all databases Azure Cosmos DB Azure DocumentDB Azure SQL Azure Database for PostgreSQL Azure Managed Redis Microsoft Fabric Azure Databricks Linux virtual machines in Azure Windows Server on Azure Azure Functions Azure Virtual Machine Scale Sets Azure API Management Azure Container Apps Azure Kubernetes Service (AKS) Azure Kubernetes Fleet Manager Azure Container Registry Azure Red Hat OpenShift Azure Container Instances Azure Container Storage Azure Arc Azure Local Microsoft Defender for Cloud Azure Monitor Microsoft Sentinel Azure Migrate View all solutions (40+) Cloud solutions for small and medium businesses Cloud migration and modernization center Data analytics for AI Azure Databases AI apps and agents Microsoft Marketplace Microsoft Sovereign Cloud AI apps and agents Responsible AI with Azure AI Infrastructure Data analytics for AI Machine learning operations (MLOps) Low-code application development on Azure Integration Services Serverless computing DevOps Migration and modernization center .NET apps migration Databases on Azure Linux on Azure Oracle on Azure SAP on the Microsoft Cloud Adaptive cloud High-performance computing (HPC) Infrastructure as a service (IaaS) Resiliency Azure Essentials Azure Accelerate FinOps on Azure Microsoft Marketplace Azure pricing overview Create an Azure account Free Azure services Flexible purchase options Pricing calculator FinOps on Azure Maximize ROI from AI Azure savings plans Azure reservations Azure Hybrid Benefit Virtual Machines Azure SQL Microsoft Foundry Microsoft Fabric Azure Kubernetes Service (AKS) Microsoft Defender for Cloud Software Development Companies Microsoft Marketplace Find a partner Get started with Azure Customer stories Analyst reports, white papers, and e-books Videos Learn more about cloud computing Documentation Explore Azure portal Developer resources Quickstart templates Resources for startups Developer community Students Azure for partners Blog Events and Webinars Learn Support Contact Sales Get started with Azure Sign in
  • 4 min read

Improving Azure Virtual Machines resiliency with Project Tardigrade

Codenamed Project Tardigrade, this effort draws its inspiration from the eight-legged microscopic creature, the tardigrade also known as the water bear. Virtually impossible to kill, tardigrades can be exposed to extreme conditions, but somehow still manage to wiggle their way to survival.

“Our goal is to empower organizations to run their workloads reliably on Azure. With this as our guiding principle, we are continuously investing in evolving the Azure platform to become fault resilient, not only to boost business productivity but also to provide a seamless customer experience. Last month I published a blog post highlighting several initiatives underway to keep improving in this space, as part of our commitment to provide a trusted set of cloud services. Today I wanted to expand on the mention of Project Tardigrade – a platform resiliency initiative that improves high availability of our services even during the rare cases of spontaneous platform failures. The post that follows was written by Pujitha Desiraju and Anupama Vedapuri from our compute platform fundamentals team, who are leading these efforts.” Mark Russinovich, CTO, Azure


This post was co-authored by Mukhtar Ahmed, Senior Software Engineer; Pujitha Desiraju, Program Manager II; Jim Cavalaris, Principal Software Engineer; Gaurav Jagtiani, Principal Software Engineering Manager; and Anupama Vedapuri, Principal Program Manager

Codenamed Project Tardigrade, this effort draws its inspiration from the eight-legged microscopic creature, the tardigrade also known as the water bear. Virtually impossible to kill, tardigrades can be exposed to extreme conditions, but somehow still manage to wiggle their way to survival. This is exactly what we envision our servers to emulate when we consider resiliency, hence the name Project Tardigrade. Similar to a tardigrade’s survival across a wide range of extreme conditions, this project involves building resiliency and self-healing mechanisms across multiple layers of the platform ranging from hardware to software, all with a view towards safeguarding your virtual machines (VMs) as much as possible.

An image of a tardigrade.

How does it work?

Project Tardigrade is a broad platform resiliency initiative which employs numerous mitigation strategies with the purpose of ensuring your VMs are not impacted due to any unanticipated host behavior. This includes enabling components to self-heal and quickly recover from potential failures to prevent impact to your workloads. Even in the rare cases of critical host faults, our priority is to preserve and protect your VMs from these spontaneous events to allow your workloads to run seamlessly.

One example recovery workflow is highlighted below, for the uncommon event in which a customer initiated VM operation fails due to an underlying fault on the host server. To carry out the failed VM operation successfully, as well proactively prevent the issue from potentially affecting other VMs on the server, the Tardigrade recovery service will be notified and will begin executing failover operations.

The following phases briefly describe the Tardigrade recovery workflow:

Phase 1:

This step has no impact to running customer VMs. It simply recycles all services running on the host. In the rare case that the faulted service does not successfully restart, we proceed to Phase 2.

Phase 2:

Our diagnostics service runs on the host to collect all relevant logs/dumps systematically, to ensure that we can thoroughly diagnose the reason for failure in Phase 1. This comprehensive analysis allows us to ‘root cause’ the issue and thereby prevent reoccurrences in the future.

Phase 3:

At a high level, we reset the OS into a healthy state with minimal customer impact to mitigate the host issue. During this phase we preserve the states of each VM to RAM, after which we begin to reset the OS into a healthy state. While the OS swiftly resets underneath, running applications on all VMs hosted on the server briefly ‘freeze’ as the CPU is temporarily suspended. This experience is similar to a network connection temporarily lost but quickly resumed due to retry logic. After the OS is successfully reset, VMs consume their stored state and resume normal activity, thereby circumventing any potential VM reboots.

With the above principles we ensure that the failure of any single component in the host does not impact the entire system, making customer VMs more immune to unanticipated host faults. This also allows us to recover quickly from some of the most extreme forms of critical failures (like kernel level failures and firmware issues) while still retaining the virtual machine state that you care about.

Going forward

Currently we use the aforementioned Tardigrade recovery workflow to catch and quickly recover from potential software host failures in the Azure fleet. In parallel we are continuously innovating our technical capabilities and expanding to different host failure scenarios we can combat with this resiliency initiative.

We are also looking to explore the latest innovations in machine learning to harness the proactive capabilities of Project Tardigrade. For example, we plan to leverage machine learning to predict more types of host failures as early as possible. For example, to detect abnormal resource utilization patterns of the host that may potentially impact its workloads. We will also leverage machine learning to help recommend appropriate repair actions (like Tardigrade recovery steps, potentially live migration, etc.) thereby optimizing our fleetwide recovery options.

As customers continue to shift business-critical workloads onto the Microsoft Azure cloud platform, we are constantly learning and improving so that we can continue to meet customer expectations around interruptions from unplanned failures. Reliability is and continues to be a core tenet of our trusted cloud commitments, alongside compliance, security, privacy, and transparency. Across all of these areas, we know that customer trust is earned and must be maintained, not just by saying the right thing but by doing the right thing. Platform resiliency as practiced by Project Tardigrade is already strengthening VM availability by ensuring that underlying host issues do not affect your VMs.

We will continue to share further improvements on this project and others like it, to be as transparent as possible about how we’re constantly improving platform reliability to empower your organization.

English (United States)
Your Privacy Choices Opt-Out Icon Your Privacy Choices
Consumer Health Privacy Sitemap Contact Microsoft Privacy Manage cookies Terms of use Trademarks Safety & eco Recycling About our ads