Root Cause Analysis for recent Windows Azure Service Interruption in Western Europe

在 八月 2, 2012 上貼文

Corporate Vice President, Azure Infrastructure and Management

On July 26 Windows Azure’s compute service hosted within one cluster in our West Europe sub-region experienced external connectivity loss to the Internet and other parts of Windows Azure. There was no impact to other regions or services throughout the duration of the interruption. The incident began at 11:09AM GMT and lasted for 2 hours and 24 minutes. Below is a more detailed analysis of the service disruption and its resolution.

Windows Azure’s network infrastructure uses a safety valve mechanism to protect against potential cascading networking failures by limiting the scope of connections that can be accepted by our datacenter network hardware devices. Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity. Because of a rapid increase in usage in this cluster, the threshold was exceeded, resulting in a sizeable amount of network management messages. The increased management traffic in turn, triggered bugs in some of the cluster’s hardware devices, causing these to reach 100% CPU utilization impacting data traffic.

We resolved the issue by increasing limit settings in the affected cluster. We also increased the limit settings and improved automated validation across all Windows Azure datacenters. Additionally, we are applying fixes for the identified bugs to the device software. We have also improved our network monitoring systems to detect and mitigate connectivity issues before they affect running services. We sincerely apologize for the impact and inconvenience this caused our customers.

  - Mike Neil, General Manager, Windows Azure