Azure IaaS: Keep critical applications running with built-in resiliency at scale

Azure IaaS provides foundational capabilities across compute, storage, and networking to help organizations stay resilient.

This blog post is the second part of a blog series called Azure IaaS which will share best practices and guidance to help you build a trusted infrastructure platform—from performance, resiliency, and security to scalability and cost efficiency.

Disruption should not be treated as an edge case. It is a reality organizations must be prepared to navigate. That preparation starts with resiliency as a core design principle, not an afterthought. Businesses depend on a broad set of applications to run daily operations, from essential internal systems to mission-critical workloads. And across that landscape, hardware issues, maintenance events, zonal disruptions, and even regional incidents can all affect availability.

The goal of a resilient infrastructure is not to assume disruptions will never happen. It is to ensure services remain available, impacts stay contained, and recovery happens quickly when events occur. In that sense, resiliency is what helps organizations maintain continuity, protect customer trust, and operate with confidence even when conditions change.

Azure IaaS is purpose-built to offer a resilient operating environment, delivering enterprise grade-resiliency. But outcomes ultimately depend on how product features across compute, storage, and networking are brought together within customer environments to help maintain availability through disruptions. Resiliency is a shared responsibility: Azure IaaS helps organizations start from a resilient platform foundation with built-in capabilities for availability, continuity, and recovery, while customers design and configure workloads to meet their specific business and operational requirements.

Designing for resiliency is not a one-time decision, and it is rarely simple. As architectures grow more distributed and workload requirements become more demanding, the Azure IaaS Resource Center provides a centralized destination for tutorials, best practices, and guidance organizations need to build and operate resilient infrastructure with greater confidence.

Explore the Azure IaaS Resource Center

Resiliency built into the foundation of mission-critical applications

When an application is truly mission critical, downtime is not just inconvenient; it can disrupt customer transactions, delay operations, interrupt employee productivity, and create real financial and reputational impact. That is why resilient design starts with one important shift in mindset: not asking whether disruption will happen but designing for how the application will behave when it does.

Azure IaaS helps customers do that with built-in capabilities that support isolation, redundancy, failover, and recovery across the infrastructure stack. The value of those capabilities is not just technical. It is operational. They help organizations reduce the blast radius of disruption, improve continuity, and recover with greater predictability when critical services are under pressure.

Keep applications available with resilient compute design

Compute resiliency starts with placement and isolation. For example, if all the virtual machines supporting an application sit too close together from an infrastructure perspective, a localized event can affect more of the workload than expected.

For applications that need both scale and availability, Virtual Machine Scale Sets help automate deployment and management while distributing instances across availability zones and fault domains. This is especially valuable for front-end tiers, application tiers, and other distributed services where maintaining enough healthy instances is key to staying online.

For broader protection, availability zones provide datacenter-level isolation within a region. Each zone has independent power, cooling, and networking, which allows organizations to architect applications across zones so that if one zone is affected, healthy instances in another zone can continue serving the workload.

Together, these capabilities help organizations reduce single points of failure and design compute architectures that are better prepared to absorb localized infrastructure events, planned maintenance, and zonal disruptions.

Build continuity and recovery on a resilient storage foundation

When disruption occurs, organizations need confidence that application data is still durable, accessible, and recoverable. Azure provides multiple storage redundancy models to support those needs. Locally redundant storage (LRS) keeps multiple copies of data within a single datacenter. Zone-redundant storage (ZRS) replicates data synchronously across availability zones within a region, helping protect against zonal failures. For broader cross-geographical resiliency scenarios, geo-redundant storage (GRS) and read-access geo-redundant storage (RA-GRS) extend protection to a secondary region.

For managed disks and virtual machine-based workloads, recovery is also shaped by capabilities such as snapshots, Azure Backup, and Azure Site Recovery. These are not just backup features in the abstract. They are mechanisms that help define how much data an organization could lose and how quickly an application can be restored after an incident.

That is why storage decisions should not be treated as only a performance or capacity conversation. For stateful applications especially, storage is central to recovery point objectives, recovery time objectives, and the broader question of how the business resumes operation after disruption.

Keep network traffic moving when conditions change

A workload is not truly available if users and dependent services cannot reach it. Even when compute and storage remain healthy, traffic disruption can still turn a manageable infrastructure event into a customer-facing outage.

That is where networking plays a distinct resiliency role. Azure networking services help maintain reachability by distributing traffic across healthy resources and redirecting around issues when conditions change. Azure Load Balancer helps spread traffic across available instances. Application Gateway adds intelligent Layer 7 routing for web applications. Traffic Manager uses DNS-based routing across endpoints, while Azure Front Door helps direct and failover internet traffic at a global level.

For customers, the value here is practical. Good networking design means that when one instance, zone, or endpoint becomes unavailable, traffic can move to a healthy path instead of stopping altogether. That can be the difference between a brief, invisible reroute and an outage your users immediately feel.

In mission-critical environments, resilient networking is what connects healthy infrastructure to real-world continuity.

Tailor resiliency to what each workload demands

Not all workloads require the same resiliency approach, and recognizing those differences is central to effective architecture and design. A stateless application tier may benefit most from autoscaling, zone distribution, and rapid instance replacement. A stateful workload may require stronger replication, backup, and failover planning because continuity depends just as much on the integrity of the data as the availability of the compute layer.

Mission-critical workloads often demand more from every layer of the stack. They may need tighter recovery targets, broader failure isolation, and more rigorously tested recovery paths than lower-priority internal systems. That does not mean every workload requires the highest possible level of redundancy. It means resiliency architecture should be guided by business impact.

Azure IaaS gives customers flexibility. The same platform can support different patterns depending on workload criticality, operational needs, and acceptable tradeoffs around cost, complexity, and recovery speed.

Make every migration a chance to build greater resiliency

Whether organizations are migrating existing applications or deploying new ones on Azure, the transition point is one of the best opportunities to build resiliency in from the start. It is the moment to reexamine architecture choices, eliminate inherited single points of failure, and design for stronger continuity across compute, storage, and networking.

Too often, a move to the cloud simply recreates existing infrastructure patterns and carries forward the same risks. But migration or new deployment can be much more valuable than that. For example, Carne Group recently shared how its move to Azure helped turn migration into a broader resiliency strategy, combining Azure Site Recovery with Terraform-based landing zones to streamline cutover while strengthening recovery readiness and operational resilience.

With IaC in place, we could easily build a duplicate site in another region. Even in the event of a worst-case scenario, we could be back up and running more or less in the same day.
Stéphane Bebrone, Global Technology Lead at Carne Group

Learn more about how Carne Group uses Azure

This is also where infrastructure as code and deployment automation play an important role. Using repeatable deployment templates and CI/CD workflows helps teams standardize resilient architectures, reduce configuration drift, and recover environments more consistently when changes or disruptions occur.

Azure Site Recovery is a foundational Azure capability for regional resilience, enabling workloads to be replicated and restarted in another Azure region on demand. Customers retain control over where and when workloads move, aligning recovery behavior with capacity, compliance, and regional availability needs.

Services such as Azure Migrate, Azure Storage Mover, and Azure Data Box support different migration scenarios. GitHub and pipeline-based deployment practices then help operationalize resiliency over time.

In that sense, this is bigger than migration alone. Whether a workload is being moved, modernized, or built new on Azure, resiliency should be part of the deployment strategy from the beginning, not added later.

Maintain resiliency after deployment as workloads evolve

Resiliency must also be maintained over time. As workloads grow and change, configuration drift, new dependencies, and evolving recovery expectations can weaken the architecture originally put in place. The most resilient organizations periodically validate readiness through testing, drills, fault simulations, and observability practices that help teams identify issues early, understand root cause, and make informed corrections. Resiliency in Azure was released in preview at Ignite to help organizations assess, improve, and validate application resiliency, with a public preview planned for Microsoft Build 2026.

Azure IaaS provides foundational capabilities across compute, storage, and networking, but resilient outcomes result from how those capabilities are combined and operationalized. By designing with disruption in mind, organizations can create architectures that stay available more consistently, protect critical data more effectively, and recover more predictably when incidents occur.

To go deeper, explore the Azure IaaS Resource Center for tutorials, best practices, and guidance across compute, storage, and networking to help you design and operate resilient infrastructure with greater confidence.

Did you miss these posts in the Azure IaaS series?

Azure IaaS: Keep critical applications running with built-in resiliency at scale

Tech Community

Resiliency built into the foundation of mission-critical applications

Keep applications available with resilient compute design

Build continuity and recovery on a resilient storage foundation

Keep network traffic moving when conditions change

Tailor resiliency to what each workload demands

Make every migration a chance to build greater resiliency

Maintain resiliency after deployment as workloads evolve

Tech Community

Azure IaaS: Explore new resources for building a stronger, more efficient infrastructure

Powering multi-cluster workloads with seamless cross‑cluster networking for Azure Kubernetes Fleet Manager

Azure NetApp Files for EDA workloads: From revolution to breakthrough at scale

Explore
Microsoft Foundry

Tech Community

Resiliency built into the foundation of mission-critical applications

Keep applications available with resilient compute design

Build continuity and recovery on a resilient storage foundation

Keep network traffic moving when conditions change

Tailor resiliency to what each workload demands

Make every migration a chance to build greater resiliency

Maintain resiliency after deployment as workloads evolve

Tech Community

ExploreMicrosoft Foundry

Explore
Microsoft Foundry