Get the tools and training you need to design and operate mission-critical systems easily and with confidence
Reliability is a shared responsibility
Explore the shared responsibilities that customers and Azure have in designing and operating resilient apps and systems. Improve the reliability of your workloads by implementing high availability, disaster recovery, backup, and monitoring on the trusted Azure cloud.
For anyone using the Azure Well-Architected Framework to improve their overall workload quality, the resources on this page will help you make improvements in line with the reliability pillar.
Start with a reliable foundation on Azure infrastructure
Learn about the cutting-edge technologies and processes.
Azure continuously delivers innovative new services and security by making constant improvements to its infrastructure. Learn how Azure minimizes the impact of these updates on customers by using a rigorous change automation process.
Learn about no- and low-impact update technologies—including hot patching, memory-preserving maintenance, and live migration—that Azure uses to maintain its infrastructure with little or no customer impact or downtime.
Choose the right Azure resilience capabilities for your needs
Find out which Azure high availability, disaster recovery, and backup capabilities to use with your apps. Also, learn how to select the compute, storage, and geographic (local, zonal, and regional) redundancy options that are right for you.
Add specialized services based on your needs
Take cloud resilience to the next level with additional Azure products and services.
Run critical workloads across datacenters with independent power, cooling, and networking.
Achieve redundancy within a datacenter by collocating or separating resources.
Implement automatic failover, optimize traffic, and combine on-premises and cloud systems.
Replicate on-premises and Azure workloads from a primary site to a secondary location.
Back up data with a simple, secure, and cost-effective recovery and restoration solution.
Create and store multiple copies of your data with redundancy options for any scenario.
Maintain high reliability and optimize performance
Ensure long-term reliability with monitoring tools to identify, diagnose, and track anomalies—and optimize for performance and cost.
Identify resource issues and resolve them using a customizable dashboard.
Collect, analyze, and act on telemetry data from Azure and on-premises environments.
Get intelligent insights into app usage and diagnose anomalies.
Reliability trusted by organizations of all sizes
For cardiac patients, EarlyWarning app can’t miss a beat
ThoughtWire is bringing its EarlyWarning app to Azure to help pre-empt and prevent cardiac arrest in hospitals by providing real-time data analysis on patients’ critical information and alerting clinicians if action is needed.
Serbia’s largest airport soars with automated recovery
– Marko Marković, IT Department Director, AD Aerodrom Nikola Tesla Beograd
"We wanted a business continuity plan for recovery for the business systems we need to run the airport, but without the expense of commissioning and maintaining secondary infrastructure. We also wanted to ensure recovery is fast and automated in the event of any failure."
Clinical alerts delivered at any scale with Stat
– John McConnell, Supervisor of Solution Architecture and Development, University of Vermont Medical Center
"We need 100 percent reliability in mission-critical apps. That's what we get from Azure SignalR Service and the other Microsoft solutions that we used to create Stat. We're very pleased."
Stable insurance app delivers premium customer experience
– Pieter Van Soerland, IT Manager, Zilveren Kruis
"Before we went through the end-of-year period, the stability and performance of the calculator were my main concerns . . . but it ended up running the smoothest out of all the solutions in our application landscape. It's very stable, and web pages even load slightly faster now."
Logistics company keeps delivering with disaster recovery solution
– Maged Kamal, Senior Director – Information Technology and General Manager at LEDD Technologies
"We sought to implement the disaster recovery on Azure as its cloud services allowed for immediate deployment. Additionally, transferring from CAPEX to OPEX model resulted in huge savings."
Making SAP even more flexible and resilient with Azure
– Erick Cardenas, IT Administrator, KOT Insurance Company AG
"Cloud computing is easily one of the best IT achievements in the business world. When we decided to move to it with the help of Microsoft Azure, we got efficiency, reliability, flexibility, and speed as much as we wanted. We still think we received nothing but benefits."
Documentation, training, and resources
Azure Architecture Center
Build reliable solutions using established patterns and best practices:
Improve workloads using five pillars of excellence: Reliability, cost optimization, operational excellence, performance efficiency, and security. Get started with this interactive assessment.
Take a structured approach to building scalable, resilient, and highly available apps based on the real-world experiences of other Azure customers.
Gain new skills to help you make your apps and systems more reliable with these free Microsoft Learn modules:
Site Reliability Engineering (SRE)
Learn how to use SRE, a discipline that helps organizations achieve the appropriate level of reliability in their systems, services, and products: