On November 18, 2014, many of our Microsoft Azure customers experienced a service interruption that impacted Azure Storage and several other services, including Virtual Machines. Following the incident, we posted a blog that outlined a preliminary Root Cause Analysis (RCA), to ensure customers understood how we were working to address the issue. Since that time, our highest priority has been actively investigating and mitigating this incident. Today, we’re sharing our final RCA, which includes a comprehensive outline of steps we’ve taken to mitigate against this situation happening again, as well as steps we’re taking to improve our communications and support response. We sincerely apologize and recognize the significant impact this service interruption may have had on your applications and services. We appreciate the trust our customers place in Microsoft Azure, and I want to personally thank everyone for the feedback which will help our business continually improve.
Root Cause Analysis
On November 18th [PST] (November 19th [UTC]) Microsoft Azure experienced a service interruption that resulted in intermittent connectivity issues with the Azure Storage service in multiple regions. Dependent services, primarily Azure Virtual Machines, also experienced secondary impact due to the connectivity loss with the Azure Storage service.
This document illustrates the standard storage update process, describes how a departure from that process resulted in the interruption, explains the cause of the resulting secondary impact suffered by some customers, discusses deficiencies in customer communications and support, and concludes with a summary of the steps we are taking to prevent a recurrence of similar interruptions and improve our communications and support response.
Microsoft Azure Storage Deployments
There are two types of Azure Storage deployments: software deployments (i.e. publishing code) and configuration deployments (i.e. change settings). Both software and configuration deployments require multiple stages of validation and are incrementally deployed to the Azure infrastructure in small batches. This progressive deployment approach is called ‘flighting.’ When flights are in progress, we closely monitor health checks. As continued usage and testing demonstrates successful results, we will deploy the change to additional slices across the Azure Storage infrastructure.
The following is the typical deployment process:
- First, we deploy updates to an internal test environment where they are tested and validated.
- After test environment validation passes, we deploy updates to pre-production environments that run production-level workloads.
- Following validation on the pre-production environment, we deploy updates to a small slice of the production infrastructure.
- After completing validation on each slice, we flight updates to incrementally larger slices. When flighting these updates, the slices are selected to be geographically isolated such that any new issue that might be exposed will only impact one region from a set of paired regions (e.g. West Europe and North Europe).
Overview of the Storage Incident
We are continuously looking for ways to improve performance of all aspects of our platform. In this case, we developed a software change to improve Azure Storage performance by reducing CPU footprint of the Azure Storage Table Front-Ends.
We deployed the software change using the described flighting approach with the new code disabled by default using a configuration switch. We subsequently enabled the code for Azure Table storage Front-Ends using the configuration switch within the Test and Pre-Production environments. After successfully passing health checks, we enabled the change for a subset of the production environment and tested for several weeks.
While testing, the fix showed notable performance improvement and resolved some known customer issues with Azure Table storage performance. Given the improvements, the decision to deploy the fix broadly in the production environment was made. During this deployment, there were two operational errors:
1. The standard flighting deployment policy of incrementally deploying changes across small slices was not followed.
The engineer fixing the Azure Table storage performance issue believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk. Unfortunately, the configuration tooling did not have adequate enforcement of this policy of incrementally deploying the change across the infrastructure.
2. Although validation in test and pre-production had been done against Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends.
Enabling this change on the Azure Blob storage Front-Ends exposed a bug which resulted in some Azure Blob storage Front-Ends entering an infinite loop and unable to service requests.
Automated monitoring alerts notified our engineering team within minutes of the incident. We reverted the change globally within 30 minutes of the start of the issue which protected many Azure Blob storage Front-Ends from experiencing the issue. The Azure Blob storage Front-Ends which already entered the infinite loop were unable to accept any configuration changes due to the infinite loop. These required a restart after reverting the configuration change, extending the time to recover.
Enforcing Deployment Policies
After the issue was discovered, all configuration changes were immediately halted to examine gaps in the core deployment tooling. After analysis was complete, we released an update to our deployment system tooling to enforce compliance to the above testing and flighting policies for standard updates, whether code or configuration.
In summary, Microsoft Azure had clear operating guidelines but there was a gap in the deployment tooling that relied on human decisions and protocol. With the tooling updates the policy is now enforced by the deployment platform itself.
Secondary Service Interruption for Virtual Machines
Because of an automated recovery mechanism in Azure Compute, almost every Virtual Machine recovered without manual intervention after the Azure Storage service interruption was resolved. However, we identified a subset of Virtual Machines that required manual recovery because they did not start successfully or were not accessible via RDP, SSH or other public endpoints. The underlying issues fell into three categories:
1. Disk Mount Timeout
During the recovery, a set of the VMs experienced a disk mount timeout error during the boot process due to the high load on the Storage Service while the service was still recovering from the interruption. These customers could unblock their VMs by following the steps we published in the Azure troubleshooting support blog. For a majority of the VMs, the required recovery step was simply to reboot the Virtual Machine, given that once the interruption was fully resolved, the load on the Azure Storage service had reduced to normal levels.
2. VMs Failed in Setup
Windows Virtual Machines that were provisioned and created during the Storage Service interruption were not able to successfully complete Windows Setup. These Virtual Machines were recreated by repeating the VM provisioning step. Linux Virtual Machines were not affected.
3. Network Programming Error
A small percentage of Virtual Machines deployed in a Virtual Network were unable to connect through their public IP address endpoints. This included RDP and SSH connectivity to these VMs. Certain conditions during the storage outage led to failure to re-program the network after the Storage interruption. Within hours, we deployed a mitigation to restore service to the affected Virtual Machines without the need for a reboot.
Outage Event Timeline [UTC]
- 11/19 00:50 AM – Detected Multi-Region Storage Service Interruption Event
- 11/19 00:51 AM – 05:50 AM – Primary Multi-Region Storage Impact. Vast majority of customers would have experienced impact and recovery during this timeframe
- 11/19 05:51 AM – 11:00 AM – Storage impact isolated to a small subset of customers
- 11/19 10:50 AM – Storage impact completely resolved, identified continued impact to small subset of Virtual Machines resulting from Primary Storage Service Interruption Event
- 11/19 11:00 AM – Azure Engineering ran continued platform automation to detect and repair any remaining impacted Virtual Machines
- 11/21 11:00 AM – Automated recovery completed across the Azure environment. The Azure Team remained available to any customers with follow-up questions or requests for assistance
Customer Communication and Support Issues
In addition to the Service interruptions described above, we did not deliver high-quality communication and support during the incident. We fell short in these three areas:
1) Delays and errors in status information posted to the Service Health Dashboard.
2) Insufficient channels of communication (Tweets, Blogs, Forums).
3) Slow response from Microsoft Support.
Service Health Dashboard initial posting delays and errors
The Service Health Dashboard has a web component as well as publishing tools used to update status. The web component has both a primary and a secondary deployment to enable fail-over, when required. When we did fail-over to the secondary web component due to the Storage interruption, the publishing tools were not able to successfully migrate from the primary to the secondary storage location. As a result, publishing updated status was delayed and led to initial confusion about overall platform status.
After resolving the data inconsistency problems, there were three additional issues that caused lack of clarity on the Service Health Dashboard:
1) Due to a bug in the dashboard header, the top of the page inaccurately listed ‘All Good’ despite the dashboard having an active advisory. An update was deployed during the incident to correct this misleading status.
2) The service grid did not accurately reflect the status of every individual Azure Service. The Service impact details were only described in the summary post and not in the grid.
3) When reporting the secondary issue impacting Virtual Machines, the Service Health Dashboard incorrectly announced the impact to only a small subset of regions. The Service Health Dashboard was later updated to reflect the full scope of the impact.
Insufficient channels of communication (Tweets, Blogs, Forums)
The Service Health Dashboard is the primary source of status information. Because of the issues described above, tweets from the @Azure account regarding the service interruption were sent to acknowledge the ongoing issue. However those tweets lacked the appropriate frequency and substance to inform our customers of current status and actions being taken to resolve the incident. Also, we did not post timely updates to the Azure Blog to provide you an ongoing update of the service interruption.
Delayed response times by Microsoft Support
The cascading effects of the Service Health Dashboard failures also impacted the Microsoft Support ecosystem. Because the Service Health Dashboard did not report accurate status, the support call volume was much higher than usual. Customers that were impacted by the incident could not find accurate information about their service and relied upon Microsoft Support. Unfortunately, due to the high customer support volume, a subset of customers experienced delayed response times to their support inquiries. In order to address these issues, the Azure engineering team supported the Azure Support team by offering direct customer engagement to help customers recover their services.
Improvements
We are committed to improving your experience with the Azure Platform and are making the following improvements:
Storage Service Interruption:
Ensure that the deployment tools enforce the deployment protocol of applying standard production changes in incremental batches.
Virtual Machine Service Interruption:
- Improve resiliency and recovery to “slow-boot” scenarios for Windows and Linux VMs.
- Improve detection and recovery from Windows Setup provisioning failures due to storage incidents.
- Fix to the Networking Service that caused networking programming errors for a subset of customers
Communications:
- Fix the Service Health Dashboard misconfiguration that led to incorrect header status
- Implement new social media communication procedures to effectively communicate status using multiple mechanisms.
- Improve resiliency of the Service Health Dashboard and authoring tools.
Support:
- Improved resiliency of Microsoft Support automation tooling and infrastructure