Advancing Microsoft Azure reliability

Publié le 15 juillet, 2019

Chief Technology Officer, Microsoft Azure

Reliance on cloud services continues to grow for industries, organizations, and people around the world. So now more than ever it is important that you can trust that the cloud solutions you rely on are secure, compliant with global standards and local regulations, keep data private and protected, and are fundamentally reliable. At Microsoft, we are committed to providing a trusted set of cloud services, giving you the confidence to unlock the potential of the cloud.

Over the past 12 months, Azure has operated core compute services at 99.995 percent average uptime across our global cloud infrastructure. However, at the scale Azure operates, we recognize that uptime alone does not tell the full story. We experienced three unique and significant incidents that impacted customers during this time period, a datacenter outage in the South Central US region in September 2018, Azure Active Directory (Azure AD) Multi-Factor Authentication (MFA) challenges in November 2018, and DNS maintenance issues in May 2019.

Building and operating a global cloud infrastructure of 54 regions made up of hundreds of evolving services is a large and complex task, so we treat each incident as an important learning moment. Outages and other service incidents are a challenge for all public cloud providers, and we continue to improve our understanding of the complex ways in which factors such as operational processes, architectural designs, hardware issues, software flaws, and human factors can align to cause service incidents. All three of the incidents mentioned were the result of multiple failures that only through intricate interactions led to a customer-impacting outage. In response, we are creating better ways to mitigate incidents through steps such as redundancies in our platform, quality assurance throughout our release pipeline, and automation in our processes. The capability of continuous, real-time improvement is one of the great advantages of cloud services, and while we will never eliminate all such risks, we are deeply focused on reducing both the frequency and the impact of service issues while being transparent with our customers, partners, and the broader industry.

Ensuring reliability is a fundamental responsibility for every Azure engineer. To augment these efforts, we have formed a new Quality Engineering team within my CTO office, working alongside our Site Reliability Engineering (SRE) team to pioneer new approaches to deliver an even more reliable platform. To keep improving our reliability, here are some of the initiatives that we already have underway:

  • Safe deployment practices – Azure approaches change automation through a safe deployment practice framework which aims to ensure that all code and configuration changes go through a cycle of specific stages. These stages include dev/test, staging, private previews, a hardware diversity pilot, and longer validation periods before a broader rollout to region pairs. This has dramatically reduced the risk that software changes will have negative impacts, and we are extending this mechanism to include software-defined infrastructure changes, such as networking and DNS.
  • Storage-account level failover – During the September 2018 datacenter outage, several storage stamps were physically damaged, requiring their immediate shut down. Because it is our policy to prioritize data retention over time-to-restore, we chose to endure a longer outage to ensure that we could restore all customer data successfully. A number of you have told us that you want more flexibility to make this decision for your own organizations, so we are empowering customers by previewing the ability to initiate your own failover at the storage-account level.
  • Expanding availability zones – Today, we have availability zones live in the 10 largest Azure regions, providing an additional reliability option for the majority of our customers. We are also underway to bring availability zones to the next 10 largest Azure regions between now and 2021.
  • Project Tardigrade – At Build last month, I discussed Project Tardigrade, a new Azure service named after the nearly indestructible microscopic animals also known as water bears. This effort will detect hardware failures or memory leaks that can lead to operating system crashes just before they occur, so that Azure can then freeze virtual machines for a few seconds so the workloads can be moved to a healthy host.  
  • Low to zero impactful maintenance – We’re investing in improving zero-impact and low-impact update technologies including hot patching, live migration, and in-place migration. We’ve deployed dozens of security and reliability patches to host infrastructure in the past year, many of which were implemented with no customer impact or downtime. We continue to invest in these technologies to bring their benefits to even more Azure services.
  • Fault injection and stress testing – Validating that systems will perform as designed in the face of failures is possible only by subjecting them to those failures. We’re increasingly fault injecting our services before they go to production, both at a small scale with service-specific load stress and failures, but also at regional and AZ scale with full region and AZ failure drills in our private canary regions. Our plan is to eventually make these fault injection services available to customers so that they can perform the same validation on their own applications and services.

Look for us to share more details of our internal architecture and operations in the future. While we are taking all of these steps to improve foundational reliability, Azure also provides you with high availability, disaster recovery, and backup solutions that can enable your applications to meet business availability requirements and recovery objectives. We maintain detailed guidance on designing reliable applications, including best practices for architectural design, monitoring application health, and responding to failures and disasters.

Reliability is and continues to be a core tenet of our trusted cloud commitments, alongside compliance, security, privacy, and transparency. Across all these areas, we know that customer trust is earned and must be maintained, not just by saying the right thing but by doing the right thing. Microsoft believes that a trusted, responsible and inclusive cloud is grounded in how we engage as a business, develop our technology, our advocacy and outreach, and how we are serving the communities in which we operate. Microsoft is committed to providing a trusted set of cloud services, giving you the confidence to unlock the potential of the cloud.