Azure reliability

Get the tools and training you need to design and operate mission-critical systems easily and with confidence

Reliability is a shared responsibility

Explore the shared responsibilities that customers and Azure have in designing and operating resilient apps and systems. Improve the reliability of your workloads by implementing high availability, disaster recovery, backup and monitoring on the trusted Azure cloud.

For anyone using the Azure Well-Architected Framework to improve their overall workload quality, the resources on this page will help you make improvements in line with the reliability pillar.

Start with a reliable foundation on Azure infrastructure

Learn about the cutting-edge technologies and processes.

Azure continuously delivers innovative new services and security by making constant improvements to its infrastructure. Learn how Azure minimises the impact of these updates on customers by using a rigorous change automation process.

Find out how Azure is working to ensure high availability for your VMs by making them less susceptible to component failures through the Project Tardigrade platform resiliency initiative.

Learn about no- and low-impact update technologies – including hot patching, memory-preserving maintenance and live migration – that Azure uses to maintain its infrastructure with little or no customer impact or downtime.

Discover how Azure is pioneering the field of AIOps to meet the challenges of ever-increasing scale and complexity of cloud infrastructure to constantly improve service quality.

Choose the right Azure resilience capabilities for your needs

Find out which Azure high-availability, disaster recovery and backup capabilities to use with your apps. Also, learn how to select the compute, storage and geographic (local, zonal and regional) redundancy options that are right for you.

Add specialised services based on your needs

Take cloud resilience to the next level with additional Azure products and services.

Availability Zones

Run critical workloads across data centres with independent power, cooling and networking.

Availability sets

Achieve redundancy within a data centre by collocating or separating resources.

Azure Traffic Manager

Implement automatic failover, optimise traffic and combine on-premises and cloud systems.

Azure Site Recovery

Replicate on-premises and Azure workloads from a primary site to a secondary location.

Azure Backup

Back up data with a simple, secure and cost-effective recovery and restoration solution.

Azure Storage

Create and store multiple copies of your data with redundancy options for any scenario.

Maintain high reliability and optimise performance

Ensure long-term reliability with monitoring tools to identify, diagnose and track anomalies – and optimise for performance and cost.

Azure Service Health

Identify resource issues and resolve them using a customisable dashboard.

Azure Monitor

Collect, analyse and act on telemetry data from Azure and on-premises environments.

Azure Application Insights

Get intelligent insights into app usage and diagnose anomalies.

Network Watcher

Monitor, diagnose and gain insights into network performance and health.

Azure Advisor

Optimise apps and systems for reliability with recommendations based on usage telemetry.

Reliability trusted by organisations of all sizes

For cardiac patients, EarlyWarning app can’t miss a beat

ThoughtWire is bringing its EarlyWarning app to Azure to help pre-empt and prevent cardiac arrest in hospitals by providing real-time data analysis on patients’ critical information and alerting clinicians if action is needed.

ThoughtWire

Serbia’s largest airport soars with automated recovery

"We wanted a business continuity plan for recovery for the business systems we need to run the airport, but without the expense of commissioning and maintaining secondary infrastructure. We also wanted to ensure recovery is fast and automated in the event of any failure."

– Marko Marković, IT Department Director, AD Aerodrom Nikola Tesla Beograd
AD Aerodrom Nikola Tesla Beograd

Clinical alerts delivered at any scale with Stat

"We need 100 percent reliability in mission-critical apps. That's what we get from Azure SignalR Service and the other Microsoft solutions that we used to create Stat. We're very pleased."

– John McConnell, Supervisor of Solution Architecture and Development, University of Vermont Medical Center
University of Vermont Medical Center

Stable insurance app delivers premium customer experience

"Before we went through the end-of-year period, the stability and performance of the calculator were my main concerns . . . but it ended up running the smoothest out of all the solutions in our application landscape. It's very stable, and web pages even load slightly faster now."

– Pieter Van Soerland, IT Manager, Zilveren Kruis
Zilveren Kruis

Logistics company keeps delivering with disaster recovery solution

"We sought to implement the disaster recovery on Azure as its cloud services allowed for immediate deployment. Additionally, transferring from CAPEX to OPEX model resulted in huge savings."

– Maged Kamal, Senior Director – Information Technology and General Manager at LEDD Technologies
Gulf Warehousing Company

Making SAP even more flexible and resilient with Azure

"Cloud computing is easily one of the best IT achievements in the business world. When we decided to move to it with the help of Microsoft Azure, we got efficiency, reliability, flexibility, and speed as much as we wanted. We still think we received nothing but benefits."

– Erick Cardenas, IT Administrator, KOT Insurance Company AG
KOT Insurance Company AG

Documentation, training and resources

Azure Architecture Center

Build reliable solutions using established patterns and best practices:

Azure Well-Architected Framework

Improve workloads using five pillars of excellence: Reliability, cost optimisation, operational excellence, performance efficiency and security. Get started with this interactive assessment.

Azure Application Architecture Guide

Take a structured approach to building scalable, resilient and highly available apps based on the real-world experiences of other Azure customers.

Microsoft Learn

Gain new skills to help you make your apps and systems more reliable with these free Microsoft Learn modules:

Site Reliability Engineering (SRE)

Learn how to use SRE, a discipline that helps organisations achieve the appropriate level of reliability in their systems, services and products:

Learn more about architecting for reliability, one of the five pillars of architectural excellence in the Azure Well-Architected Framework