Skip Navigation

Azure reliability

Get the tools and training you need to design and operate mission-critical systems with confidence

Reliability is a shared responsibility

Achieve your organization's reliability goals for all of your workloads by starting with the resilient foundation of the Azure cloud platform. Design and operate your mission-critical applications with confidence, knowing that you can trust your cloud because Azure prioritizes transparency—always keeping you informed and able to act quickly during service issues.

If you're looking to optimize an existing application on Azure, get started with the Azure Well-Architected Framework, a set of guiding tenets across five core pillars: reliability, security, performance efficiency, cost optimization, and operational excellence.

Start with a reliable foundation on Azure infrastructure

Learn about ongoing Microsoft investments to maintain and improve cloud platform reliability in Azure CTO and Technical Fellow Mark Russinovich's Advancing Reliability blog series, including these four recent topics: network reliability through intelligent software, safe development with AIOps—introducing Gandalf, resiliency threat modeling for large distributed systems, and low- and no-impact maintenance.

The Microsoft network connects more than 60 Azure regions, 220 Azure datacenters, 170 edge sites, and over 165,000 miles of terrestrial and subsea fiber worldwide, which connects to the rest of the internet at strategic global edge points of presence. Learn more about Microsoft network reliability in this two-part blog post.

The continuous monitoring of health metrics is a fundamental part of the deployment process, and this is where AIOps plays a critical role. In this blog post, learn how AI and machine learning are used to empower DevOps engineers, monitor the Azure deployment process at scale, detect issues early, and make rollout or rollback decisions based on impact scope and severity.

Find out how Azure service engineering teams use “postmortems” as a tool for better understanding what went wrong, how it went wrong, and the customer impact of outages—and get insights into postmortem and resiliency threat modeling processes.

Learn about the no- and low-impact update technologies—including hot patching, memory-preserving maintenance, and live migration—that Azure uses to maintain its infrastructure with little or no customer impact or downtime.

Choose the right Azure resilience capabilities for your needs

Find out which Azure high availability, disaster recovery, and backup capabilities to use with your apps. Also, learn how to select the compute, storage, and geographic (local, zonal, and regional) redundancy options that are right for you.

Enable built-in resilience

Take advantage of optional Azure services and features to achieve your specific reliability goals.

Availability zones

Run critical workloads across datacenters with independent power, cooling, and networking.

Availability sets

Achieve redundancy within a datacenter by collocating or separating resources.

Azure Traffic Manager

Implement automatic failover, optimize traffic, and combine on-premises and cloud systems.

Azure Site Recovery

Replicate on-premises and Azure workloads from a primary site to a secondary location.

Azure Backup

Back up data with a simple, secure, and cost-effective recovery and restoration solution.

Azure Storage

Create and store multiple copies of your data with redundancy options for any scenario.

Monitor your cloud so that it isn’t a black box

Ensure long-term reliability with monitoring tools to identify, diagnose, and track anomalies—and optimize your reliability and performance.

Azure Chaos Studio

Systematically improve resilience with controlled chaos.

Azure Service Health

Identify resource issues and resolve them using a customizable dashboard.

Azure Monitor

Collect, analyze, and act on telemetry data from Azure and on-premises environments.

Azure Application Insights

Get intelligent insights into app usage and diagnose anomalies.

Network Watcher

Monitor, diagnose, and gain insights into network performance and health.

Azure Advisor

Optimize apps and systems for reliability with recommendations based on usage telemetry.

Reliability trusted by organizations of all sizes

ClearBank builds infrastructure resilience, customer trust, and competitive value

"Ensuring end-to-end reliability and resiliency is a team effort. We get the tools from Azure, and we set up the systems and processes to put it all together."

Tom Harris, Chief Technology Officer, ClearBank
ClearBank

Kodak Alaris boosts productivity by improving ERP resilience

"The one thing I don't want is my CIO coming to me because there's a problem with our ERP. The truth is, it never happens anymore—it's a real testament to our ERP's reliability in Azure."

– Joseph Calabrese, IT Operations Manager, Kodak Alaris
Kodak Alaris

Serbia’s largest airport soars with automated recovery

"We wanted a business continuity plan for recovery for the business systems we need to run the airport, but without the expense of commissioning and maintaining secondary infrastructure. We also wanted to ensure recovery is fast and automated in the event of any failure."

– Marko Marković, IT Department Director, AD Aerodrom Nikola Tesla Beograd
AD Aerodrom Nikola Tesla Beograd

Marie Curie provides more stable, reliable services

"In the last two and a half years, we've had one outage which has been due to cloud infrastructure failing. It just almost instantly gave us stability, space to breathe, enabled us to focus on bringing real value to the organization."

Ivan Delany, IT Director, Marie Curie
Marie Curie

Juvare drives reliability and integrity for their incident platform

"We architected our solution to spread workloads across different availability zones and regions, to maintain both client requirements for geographic data residency but also to ensure that if one particular part of our infrastructure was having a problem, it reduced the blast radius."

Bryan Kaplan, Chief Information Officer, Juvare
Juvare

GEP improved the reliability of its logistics platform

"We use AKS or Azure Kubernetes Service inbuilt node pools...say your primary node pool is down, within the cluster you're automatically able to failover to the second availability zone."

Nithin Prasad, Principal Engineer, GEP
gep

Documentation, training, and resources

Azure Architecture Center

Build reliable solutions using established patterns and best practices:

Microsoft Learn

Gain new skills to help you make your apps and systems more reliable with these free Microsoft Learn modules:

Site Reliability Engineering (SRE)

Learn how to use SRE, a discipline that helps organizations achieve the appropriate level of reliability in their systems, services, and products:

Learn more about architecting for reliability, one of the five pillars of architectural excellence in the Azure Well-Architected Framework