Azure status history

Product:

Region:

Date:

August, 2017

17-8

Storage - West US

Summary of impact: Between 18:50 and 20:45 UTC on 17 Aug 2017, a subset of customers using Storage in West US may have experienced intermittent difficulties connecting to Storage resources hosted in this region. During the impact, only storage services utilizing Premium Disks were impacted, other storage services experienced no impact. Other services that leverage these Storage resources in this region may also have experienced impact related to this issue. Engineers identified that a very limited set of residual impact to Virtual Machines. These customers will be contacted via their management portal ()

Preliminary root cause: Engineers identified a software error as the potential root cause.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

Next steps: A detailed post incident report will be published approximately in 72 hours.

17-8

HDInsight - Japan East

Summary of impact: Between 05:30 and 08:18 UTC on 17 Aug 2017, you were identified as a customer using HDInsight in Japan East who may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region.

Preliminary root cause: At this stage engineers do not have a definitive root cause.

Mitigation: Engineers performed a refresh operation to the service configuration to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

17-8

Microsoft Azure portal - Multi-Region

Summary of impact: Between 18:09 UTC on 16 Aug 2017 and 06:07 UTC on 17 Aug 2017, users deploying new management certificates or viewing existing ones in the portal may have seen management certificates listed as "Expired." Certificates were not actually expired. This was an error when rendering the visual information back to the user.

Preliminary root cause: Engineers identified a software error as the potential root cause.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

16-8

App Service - South Central US

Summary of impact: Between 00:46 UTC and 01:31 UTC on 16 Aug 2017, a subset of customers using App Service in South Central US may have experienced timeouts or high latency when accessing App Service deployments hosted in this region.

Mitigation: Engineers determined that the issue was self-healed by the Azure platform.

4-8

Visual Studio Team Services - Multiple Regions

Summary of impact: Between 10:30 and 12:34 UTC on 04 Aug 2017, a subset of customers using Visual Studio Team Services (VSTS) may have experienced '500 Internal Server Error' while accessing their VSTS portal, in addition to build failures.

Preliminary root cause: Engineers determined that a some recently upgraded network devices experienced a fault and that network traffic was not automatically rerouted.

Mitigation: Engineers took the faulty network devices out of rotation and rerouted network traffic to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. More information can be found on 

4-8

RCA - SQL Database, SQL Data Warehouse, and Virtual Machines – West US

Summary of impact: Between 07:31 UTC on 03 Aug 2017 and 05:30 UTC on 04 Aug 2017, a subset of customers using SQL Database in West US may have experienced intermittent issues accessing services. New connections to existing databases in this region may have resulted in an error or timeout, and existing connections may have been terminated. Additionally, customers using Virtual Machines in West US may have experienced connection failures when trying to access some Virtual Machines hosted in the region. These Virtual Machines may have also restarted unexpectedly. Approximately 2.5% of the SQL databases and 5.4% of the Virtual Machines in the region were impacted by this incident. Approximately 2.5% of the SQL databases or datawarehouses in the region experienced unplanned failovers with an average of 8 seconds, during which all inflight queries were canceled and new requests rejected. Approximately .01% of SQL databases in the region experienced a long period of continuous unavailability.

Root cause and mitigation: A subset of nodes in the West US region entered an unhealthy state due to loss of communication to the management and cooling interface from the chassis controller. Due to the loss of communication, the node management system inside the node went into an unresponsive state. This prevented the affected nodes from being adequately cooled and caused a thermal trip. Additional root cause analysis is underway to better understand the cause for the loss of communication to the management interface from the chassis controller. As nodes went unhealthy and were queued for reboot, most of the node reboots proceeded without extended impact to Virtual Machines (VMs) hosted on those nodes. Azure SQL DB is a stateful tenant using Azure Service Fabric and prevents reboots that would cause all replicas of a database to be offline at the same time. Allowing that would lead to that database becoming unavailable if the node did not reboot successfully. Holding up some reboots led to accumulation of nodes that were unhealthy, and databases that had more than 2 out of 4 replicas across those nodes became unavailable. The unavailability of those databases persisted until the nodes were manually restarted. The databases that were unavailable were different than the databases that were blocking the reboots. In this case because reboot fixed the cause of the node becoming unhealthy and no nodes failed to reboot successfully, ideally the system would have rebooted the nodes automatically. Once the high rate of new unhealthy nodes was detected, Azure engineers contained the impact of the incident by 1) moving all remaining replicas of databases off the nodes that were in parts of the datacenter experiencing the issue. This led to additional short failovers but prevented any further databases from having long unavailability. 2) manually rebooting nodes that were unhealthy.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to:
1) Complete further analysis on loss of communication to the chassis controller.
2) Improve alerting for number of nodes becoming unhealthy and rebooting.
3) Improve alerting for number of nodes pending reboot.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

1-8

RCA - Connectivity Issues in Asia-Pacific

Summary of impact: Between 23:26 and 23:29 UTC on 01 Aug 2017, a subset of customers may have experienced connectivity latency, timeouts, or failures in the Asia-Pacific region due to a network event. An Azure network device in Japan encountered an unexpected firmware defect during a routine maintenance operation. The defect caused routing to other network destinations on the device to stop working. The Azure network automatically began to redirect traffic via a healthy route to mitigate any further connectivity failures. Currently, Azure has multiple routes to Asia-Pacific from North America including a route from Los Angeles to Tokyo. During the impacted window, one third traffic going through this route would have been impacted. This route accounts for approximately one ninth of the routes that serves traffic between Asia-Pacific and North America. This impact would have potentially spanned multiple Azure services, and has been correlated to connectivity issues with Virtual Machines, VPN Gateways, Web Apps, Application Insights, Traffic Manager, Media Services and ExpressRoute.

Root cause and mitigation: Routine maintenance work was being carried out on a core router. A firmware issue triggered by a maintenance operation caused the router to drop all LDP and BGP sessions. Monitoring triggered immediately and all the traffic started re-routing through healthy routes at 23:27. By 23:29 all dropped packets had been flushed out of the system and traffic was flowing correctly between Asia-Pacific and North America. In very rare circumstances some customers may have experienced issues outside of the impact window. This would have been due to the way that the Applications are designed to ingest and process data, and then resume once connectivity is re-established.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure that such incidents do not occur in the future. In this case, this includes (but is not limited to):
1) Pausing all non-critical maintenance on this router until the root cause is better understood. [COMPLETE]
2) Continuing the investigation into the network router with the vendor. [IN PROGRESS]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

July, 2017

28-7

RCA - Storage - South Central US

Summary of impact: Between 16:49 UTC on 28 July 2017 and 01:04 UTC on 29 July 2017, a subset of customers using Storage in South Central US may have experienced difficulties connecting to resources hosted in this region. In addition, a subset of customers using Backup in South Central US may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region. Engineers determined that multiple hardware failures impacted the availability of the control plane for the load balancer for a single storage cluster which in turn impacted the availability of the control endpoint for that storage cluster. A new server was added to restore quorum to the control plane and mitigate the incident.

Customer impact: Create/update or delete operations would fail for storage accounts hosted in the impacted storage cluster.

Root cause and mitigation: The load balancer needs a certain number of healthy servers for the control plane to work correctly. In this case, a subset of servers were already offline and being replaced due to hardware failures when another node experienced a hardware failure. To mitigate the issue, services were migrated off of another server and the new server was added to the load balancer control plane. Adding the new server required a series of manual steps that replicated data and then restored quorum to the control plane.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): Accelerate the RMA process for servers hosting balancer services [in progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

21-7

RCA - Network Infrastructure - UK South

Summary of impact: Between July 20, 2017 21:41 UTC and July 21, 2017 1:40 UTC, a subset of customers may have encountered connectivity failures for their resources deployed in the UK South region. Customers would have experienced errors or timeouts while accessing their resources. Upon investigation, the Azure Load Balancing team found that the data plane for one of the instances of Azure Load Balancing service in UK South region was down. A single instance of Azure Load Balancing service has multiple instances of data plane. It was noticed that all data plane instances went down in quick succession and failed repeatedly whilst trying to self-recover. The team immediately started working on the mitigation to fail over from the offending Azure Load Balancing instance to another instance of Azure Load Balancing service. This failover process was delayed due to the fact that VIP address of Azure authentication service used to secure access to any Azure production service in that region was also being served by the Azure Load Balancing service instance that went down. The Engineering teams resolved the access issue and then recovered the impacted Azure Load Balancing service instance by failing over the impacted customers to another instance of Azure Load Balancing service. The dependent services recovered gradually once the underlying load balancing service instance was recovered. Full recovery by all of the affected services was confirmed by 01:40 UTC on 21 July 2017.

Workaround: Customers who had deployed their services across multiple regions could fail out of UK South region.

Root cause and mitigation: The issue occurred when one of the instances of Azure Load Balancing service went down in the UK South region. The root cause of the issue was a bug in the Azure Load Balancing service. The issue was exposed due to a specific combination of configurations on this load balancing instance combined with a deployment specification that caused the data plane of the load balancing service to crash. There are multiple instances of data plane in a particular instance of Azure Load Balancing Service. However, due to this bug, the crash cascaded through multiple instances. The issue was recovered by failing over from the specific load balancing instance to another load balancing instance. The software bug was not detected in deployments in prior regions because it only manifested under specific combinations of the configuration in Azure Load Balancing services. The combination of configurations that exposed this bug was addressed by recovering the Azure Load Balancing service instance.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, we will: 1. Roll out a fix to the bug which caused Azure Load Balancing instance data plane to crash. In the interim a temporary mitigation has been applied to prevent this bug from resurfacing in any other region. 2. Improve test coverage for the specific combination of configuration that exposed the bug. 3. Address operational issues for Azure Authentication services break-glass scenarios.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

20-7

App Service/Web apps - West Europe

Summary of impact: Between 12:00 and 15:09 UTC on 20 Jul 2017, a subset of customers using App Service \ Web Apps in West Europe may have experienced timeouts or high latency when accessing Web App deployments hosted in this region.

Preliminary root cause: Engineers determined that a spike in traffic was preventing requests from completing.

Mitigation: Engineers determined that the issue was self-healed by the Azure platform.

Next steps: Engineers will review testing procedures to prevent future reoccurrences.

4-7

Azure Active Directory - Germany Central and Germany Northeast

Summary of impact: Between approximately 16:15 and 18:55 UTC on 04 Jul 2017, subset of customers using Azure Active Directory in Germany Central or Germany Northeast may have experienced difficulties when attempting to authenticate into resources which are dependent on Azure Active Directory.

Preliminary root cause: Engineers determined that instances of a backend service reached an operational threshold.

Mitigation: Engineers manually added additional resources to alleviate the traffic to the backend service and allow healthy connections to resume.

Next steps: Engineers will continue to validate full recovery and understand the cause of the initial spike in traffic.