Azure status history

Product:

Region:

Date:

April 2017

4/11

Iot Suite - Failures Provisioning New Solutions - Germany

Summary of impact: Between 07:15 on 08 Apr 2017 and 04:00 UTC on 11 Apr 2017, customers using Azure IoT Suite may have been unable to provision solutions. Engineers recommended deploying from an MSbuild prompt using code at . Existing resources were not impacted.

Preliminary root cause: Engineers identified a recent change to backend systems as the preliminary root cause.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

4/11

IoT Suite - Failures Provisioning New Solutions

Summary of impact: Between 07:15 on 08 Apr 2017 and 00:00 UTC on 11 Apr 2017, customers using Azure IoT Suite may have been unable to provision solutions. Engineers recommended deploying from an MSbuild prompt using code at . Existing resources were not impacted.

Preliminary root cause: Engineers identified a recent change to backend systems as the preliminary root cause.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

4/6

App Services - North Europe

Summary of impact: Between 17:45 and 19:42 UTC on 06 Apr 2017, a subset of customers using App Services in North Europe may have received HTTP 500 errors or have experienced high latency when accessing App Service deployments hosted in this region.

Preliminary root cause: A backend node went into a unhealthy state.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

4/6

RCA - App Service \ Web Apps – North Europe

Summary of impact: From 10:00 to 15:25 UTC on 6 April, 2017 a subset of customers using App Service \ Web Apps on a single scale unit in North Europe may have experienced intermittent HTTP 500 errors, connection timeouts or latency issues while using their App service applications.

Root cause: The root cause for this issue was a bug in the infrastructure layer that only manifests itself while scaling out front ends under high load. This bug introduced incompatible metadata on the front ends in the scale unit and prevented them from servicing requests successfully. The issue was automatically detected and Engineers mitigated the issue by updating the impacted infrastructure components. A long term fix to prevent future recurrences is also being deployed.

Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):

1) Deployment of a long term fix to infrastructure layer.
2) Updated threshold to enable posting to the Azure Service Health Dashboard in case of significant impact to a single scale unit.

March 2017

3/31

RCA – Cloud Services and Virtual Machines – East US 2

Summary of impact: Between 14:00 UTC on 30 Mar 2017 and 22:05 UTC on 31 Mar 2017, a subset of customers using Cloud Services and Virtual Machines in the East US 2 region may have experienced slowness, degraded performance, or connection failures when accessing Azure resources in this region. The root cause was due to an unhealthy aggregation layer network device in the East US 2 region. The aggregation layer device was removed from service to mitigate the issue. This alleviated network congestion and restored network health to customer Virtual Machines.

Root cause and mitigation: An aggregation layer network device in the East US 2 region experienced link error issues on 20 Mar 2017. The affected link was removed from service. Due to the configuration on an upstream router, an aggregated bundle of links was removed from service, reducing total available uplink bandwidth by 50%. On 30 Mar 2017, this reduction in bandwidth combined with an increase in normal traffic to cause congestion and packet loss. Mitigation was achieved by removing the unhealthy aggregation layer network device from service.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1) Repair congestion alerting services for aggregation devices in East US 2.
2) Update the link error process for aggregation devices to ensure that bundles remain in service.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

3/31

RCA – Cooling Event – Japan East – Additional Information

Summary of impact: Between 11:28 UTC and 22:16 UTC on March 31 2017, a subset of customers in Japan East region may have experienced unavailability of Virtual Machines (VMs), VM reboots, degraded performance or connectivity failures when accessing those resources or/and service resources dependent upon Storage service in this region.

As a part of standard monitoring, Azure Engineers received alerts for availability drops for this region. Engineers identified the underlying cause was due to an error in the safe power recovery procedure followed by a failure within the power distribution system that was running at N+2. One RUPS (rotary uninterruptible power supply) in the N+2 parallel line up failed and resulted in being unable to supply power to the cooling system in this datacenter. As a consequence of the cooling system going down, some resources were automatically shutdown to avoid overheating and ensure data integrity and resilience The first failure within the power distribution system that was running at N+2 occurred at 11:28 UTC, and the Facility Service Provider promptly responded and initiated the safe power recovery procedure.
There was an error in the safe power recovery procedure and one of the cooling systems was incorrectly shutdown at 12:40 UTC. As a result of this, some areas in the facility lost the cooling function and temperatures inside the facility went up and passed safe thresholds.

Between 12:45 UTC and 13:12 UTC, Azure engineers and the Facility Service Provider received multiple overheating alerts due to overheating event at the facility, and started using outside airflow to force cool the datacenter.

At 13:46 UTC, Microsoft site services personnel were onsite with the Facility Service Provider and restarted the cooling system air handlers as well as continued using outside airflow to force cool the datacenter. At the same time, Azure engineers prepared to bring systems back online when cooling was restored to the datacenter.

At 15:24 UTC, the Facility Service Provider confirmed that the cooling systems were restored successfully. Temperatures for some impacted area in the inside the datacenter returned to safe operational thresholds.

At 16:08 UTC, a thorough health check was completed after RUPS system and cooling systems were restored at 15:24 UTC, any suspects or failed components were replaced and isolated due to damages by overheating. Suspected and failed components are being sent for analysis.

At 16:53 UTC, Engineers confirmed that approximately 95% of all switches/network devices have been restores successfully. Power up processes began on impacted scale units that host Software Load Balancing (SLB) services and the control plane.

At 17:16 UTC, majority of the core infrastructure was brought online, Networking Engineers began restoration of Software Load Balancing (SLB) services in a controlled process to help programming to establish a quorum promptly.

Once SLB was up and running, Engineers confirmed that majority of services were recovered automatically and successfully at 18:51 UTC. Residual impacts with Virtual Machines were found, Engineers investigated and continued to recover impacted Virtual Machines to bring them online. In parallel, Engineers notified customers who had experienced residual impacts with Virtual Machines for recovery.

At 22:16 UTC, Engineers confirmed Storage and all storage dependent services recovered successfully.

Customer impact: Customers who have resources or/and impacted services in this region may have experienced unavailability of those resources for the impacted time-frame noted above.

Workaround: Virtual Machines using Managed Disks in an Availability Set could have minimized downtime if the VMs were redeployed during this incident For information around how to migrate to Managed Disks, please visit: Customers using Azure Redis Cache: although caches are region sensitive for latency and throughput, pointing applications to Redis Cache in another region could have provided business continuity. SQL database customers who had SQL Database configured with active geo-replication could have reduced downtime by performing failover to geo-secondary. This would have caused a loss of less than 5 seconds of transactions. Another workaround is to perform a geo-restore, with loss of less than 5 minutes of transactions. Please visit

During this incident, the Japan West region remained fully available, customers applications with geo-redundancy (for example, using Traffic Manager to direct requests to a healthy region) would have the application to continue without impact or would have been able minimize the impact. For further information, please visit For Best Practices for Cloud Applications and Design Patterns,

Root cause and mitigation: The investigation revealed that one RUPS system (a rotary uninterruptible power supply system) failed in a manner which impacted the power distribution that powers the air handlers units (AHU) in Japan East datacenter. An error in the recovery procedures resulted in loss of cooling functions, causing a thermal runaway. The cooling system is designed for N+1 redundancy and the power distribution design was running at N+2.

Next steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to)

1) RUPS system unit are being sent off for analysis. Root cause with site operations, facility engineers, and equipment manufacturers to further mitigate risk of recurrence.
2) Review Azure services that were impacted by this incident to help tolerate this sort of incidents to serve services with minimum disruptions by maintaining services resources across multiple scale units or implementing geo-strategy.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

3/27

RCA - Multiple Azure Services - Japan West

Summary of impact: Between 18:04 and 21:16 UTC on 27 Mar 2017, a subset of customers in Japan West experienced Virtual Machine reboots, degraded performance, or connection failures when accessing their Azure resources hosted in Japan West region. During the addition of a storage scale unit to the Japan West region, the scale unit announced routes that blocked some network connectivity between two data centers in the region. Unfortunately, automated recovery did not mitigate the issue, and the manual health checks that were conducted around all new node additions were not performed correctly. Automated alerting detected the drop in availability and engineering teams correlated the issue with the scale unit addition. The new scale unit was re-isolated and which fixed the route advertisement and restored the connectivity between two data centers. A limited subset of customers may have experienced residual impact.

Customer impact: Virtual Machines may have experienced reboots / connectivity issues. App Service \ Web Apps customers may have received HTTP 503 errors or experienced higher latency when accessing App Service deployments hosted in this region. Approximately 5% App Service \ Web Apps customers may have experienced issues until 03:30 UTC on 28 Mar 2017. Azure Search resources may have been unavailable. Attempts to provision new Azure Search services in the region may fail. Redis Cache customers may have been unable to connect to their resources. Azure Monitor customers may have been unable to auto scale and alerting functionality may have failed. Azure Stream Analytics jobs may have experienced failures when attempting to start. All existing Stream Analytics jobs that were in a running state would have been unaffected. VPN Gateway customers may have experienced disconnections during this incident.

Workaround: During this incident, the Japan East region remained fully available, customers’ applications with geo-redundancy (for example, using Traffic Manager to direct requests to a healthy region) would have allowed the application to continue without impact or would have been able to minimize the impact. For further information, please visit  for Best Practices for Cloud Applications and Design Patterns, and  for Traffic Manager.

Root cause and mitigation: A newly built storage scale unit in one data center in the Japan West region was assigned IP addresses that should have only been used in a second data center in the Japan West region. The announcement of these IP addresses activated an aggregate statement on the border routers of the first data center, which then interrupted communication between all VMs in the first data center and VMs with IP addresses belonging to the aggregate in the second data center. The IP addresses were not duplicated, but were sub-prefixes of the same aggregate. The validation code designed to prevent issuance of overlapping IP addresses did not properly check that the IP addresses were not sub-prefixes of the same aggregate and so did not block assignment of these IP addresses to the new storage scale unit. The addition of new capacity to a region has been the cause of outages in the past, so both automated systems and manual checks are used to verify that there is no impact. The tool used in this turn up did not gate its activity based on the presence of a continued healthy signal from the region and so continued the turn up and failed to rollback even after the availability signal fell. The manual checks for health were performed incorrectly after the turn up, as a result the rollback was not initiated. There was a concurrent maintenance to in the Japan West region as part of a network reliability improvement project to remove the aggregates involved in this incident. We have a change advisory board in place to avoid introducing concurrent maintenance in the environment. In this case, there was a process error that resulted in a failure to identify conflicting changes. Had that maintenance completed ahead of the storage scale unit turn up, this incident would not have occurred. Unfortunately, the proximity in time between the incident start and the maintenance to remove the aggregates misled the engineers responding to availability alerts to rollback the aggregate maintenance. This increased the time taken to correctly root cause the incident and rollback the scale unit turn up, delaying the time to mitigate issues.

Next steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):

1) Remove the aggregate statements from the Japan West region that led to this incident
2) [Complete] Remove all similar aggregate statements from all regions.
3) [In progress] Improve validation checks for IP address assignment to enforce correct behavior even in the presence of aggregate statements.
4) Improve tooling that are used to turn up new scale unit to gate progress on a continuous healthy signal from the region.
5) Azure will continue to automate processes, as well as ensuring that manual health checks are performed correctly.

3/27

Visual Studio Team Services

Summary of impact: Between 06:08 and 10:15 UTC on 27 Mar 2017, a subset of customers using Visual Studio Team Services in the Azure Management Portal () may have experienced difficulties connecting to the following services: VSTS Team Projects, Team Services, Load Testing, Team Service Accounts and Release Management - Continuous Delivery. This would have manifested in the form of continuous loading in the service blade. As a workaround, customers could continue to access these services via their Visual Studio accounts (). More information is available at .

Preliminary root cause: Engineers identified a recent configuration change as the potential root cause.

Mitigation: Engineers rolled back the recent configuration change to mitigate the issue.

3/25

RCA - Intermittent Authentication Failures due to Underlying Azure Active Directory Issue

Summary of impact: Between 21:23 UTC on 24 Mar 2017 to 00:35 UTC on 25 Mar 2017, a subset of customers using Azure Active Directory (AAD) to authenticate to their Azure resources, or services with dependencies on AAD might have experienced failures. This included authentications using the Management Portal, PowerShell, Command-line interfaces (CLI) and other authentication providers. This incident started with high number of failing authentication requests across multiple regions. Failure rates were significantly higher in the US West and US South Central regions than other Azure regions. Engineering teams identified the issue to be a class of requests resulting in expensive backend queries (long running queries) causing timeouts and requests being dropped. Engineering teams worked on isolating the tenants causing this behavior to restore the service.

Customer impact: Customers using AAD, or services with dependencies on AAD authentication, such as Power BI Embedded, Visual Studio Team Services, Log Analytics, Azure Data Lake Analytics, Azure Data Lake Store, Azure Data Catalog, Application Insights, Stream Analytics, Key Vault, and Azure Automation would have seen login failures while accessing their resources.

Root cause and mitigation: Azure Active Directory (AAD) is a comprehensive identity and access management cloud solution. The security token service, which is a significant part of AAD, supports authentication for all modern authentication protocols. During the time of the incident, a specific behavior in the backend of the AAD Security Token Service (STS), resulted in high latency in which to process certain requests as it issued expensive queries to look up the database. This caused the backend of the STS to become overwhelmed and resulted in timeouts and requests being dropped. The Engineering team identified and blocked the requests that were causing these expensive queries. This resulted in full restoration of the service at 00:35 25 Mar 2017 UTC.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
- Improve telemetry mechanism for identifying the expensive queries to the AAD STS.
- Implement isolation mechanisms to identify and help stop processing expensive queries beyond a certain threshold.
- Enforce back end throttling based on various parameters (like CPU for example) to help protect against inefficient queries.
- Implement Fault isolation where a problem in one fault unit is prevented from spilling over beyond that fault unit.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

3/23

RCA – Data Lake Analytics and Data Lake Store – East US 2

Summary of impact:

Between approximately March 22 15:02 UTC and March 22 17:25 UTC, a subset of customers experienced a number of intermittent failures when accessing Azure Data Lake Store (ADLS) in East US 2. This may have caused Azure Data Lake Analytics job and/or ADLS request failures. This incident was because of a misconfiguration which resulted in one of the ADLS microservices not serving the requests for the customers.

Azure Monitoring detected the event and an alert was triggered. Azure engineers engaged immediately and mitigated the issue by reverting the configuration that triggered the incident.

Customer impact: A subset of customers in East US 2 had intermittent failures when accessing their Azure Data Lake Service during the timeframe mentioned above.

Workaround:

Retry failed operation. If a customer uses Azure Data Lake Store Java SDK 2.1.4 to access the service, the SDK would have automatically retried and some of the requests may have succeeded with higher latencies.

Root cause and mitigation:

A misconfiguration in one of the microservices that ADLS depends on caused the ADLS microservices to not serve requests after they, or the instances they resided on restarted (which restart is part of regular maintenance process). As more instances were restarted, there were fewer instances of the affected ADLS microservice remaining to serve the requests, which resulted in high latencies and consequently requests failing. Azure engineers identified and reverted the configuration at fault. As the configuration fix was propagated as an expedited fix, and the services were restarted, the affected service instances started to serve requests again, which mitigated this issue. This resulted in intermittent Azure Data Lake Analytics job and ADLS request failures. The jobs that were submitted during the incident are expected to have been queued and ran after the incident, which would have resulted in delays for the start of the jobs.


Was the issue detected?
The issue was detected by our telemetry. An alert was raised, which prompted Azure engineers to investigate an Azure Data Lake Store issue and mitigate the issue.


To achieve quickest possible notification of Service Events to our customers, the Azure infrastructure has a framework that automates the stream from alert to Service Health Dashboard and/or Azure Portal Notifications. Unfortunately, this class of alert does not contain the needed correlation for automation now.  We did surface this outage via the Resource Health feature with customer’s ADLA and ADLS account(s) in this region.  We will continue to implement notification automations as well as ensuring manual communications protocols are followed quickly as possible. In this incident, the issue was announced on Service Health Dashboard manually.

Next steps:

We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

• Fix the misconfiguration. [Completed]
• Improve testing by adding more validations for this sort of configuration error to prevent future occurrences. [Completed]
• Separate the defined configuration container such that it is limiting the impact of future occurrences to smaller slices of the service. [In progress]
• Improve the format of the configuration file to make the code review more efficient in the future. [In progress]
• Add alerts deeper in the stack such that there is earlier/redundant indications of a similar problem for faster debugging. [In progress]
• Rearchitect the initialization phase of the affected service to reduce the dependency on the service that was originally impacted. [Long term]

Provide feedback:

Please help us improve the Azure customer communications experience by taking our survey

3/23

Azure Active Directory

Summary of impact: Between 22:00 UTC on 22 Mar 2017 and 00:30 UTC on 23 Mar 2017, Azure Active Directory encountered an issue that affected service to service authentication that impacted multiple Azure services. Azure Resource Manager, Logic Apps, Azure Monitor, Azure Data Lake Analytics, and Azure Data Lake Store customers may have experienced intermittent service management issues as downstream impact. AAD customers would not have seen impact.

Preliminary root cause: Engineers identified a recent configuration change as the potential root cause.

Mitigation: Engineers rolled back the configuration change to mitigate the issue.