Azure-statushistorikk

Produkt:

Område:

Dato:

september 2017

15.9

App Service - North Europe

Summary of impact: Between 19:30 and 20:15 UTC on 15 Sep 2017, a subset of customers using App Service in North Europe may have received HTTP 500-level response codes, have experienced timeouts or high latency when accessing App Service deployments hosted in this region.

Preliminary root cause: At this stage engineers do not have a definitive root cause.

Mitigation: Engineers determined that the issue was self-healed by the Azure platform.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences

7.9

Azure Active Directory Domain Services - Issues enabling and disabling AAD

Our investigation of alerts for Azure Active Directory Domain Services in multiple regions is complete. Due to the extremely limited number of customers impacted by this issue, we are providing direct communication to those experiencing an issue via the Azure Management Portal ( and ).
4.9

RCA - ExpressRoute \ ExpressRoute Circuits - Washington DC

Summary of impact: Between approximately 09:07 and 14:56 UTC on 04 Sep 2017, a subset of customers using ExpressRoute Services with circuits terminating in the Washington DC region may have experienced difficulties connecting to Microsoft Azure resources, Dynamics 365 services or Office 365 services. Customers with backup Express Route Service circuits in other regions or with internet failover paths should not have been impacted. Customers using Azure Virtual Network services were not impacted during this time.

Customer impact: Connectivity between customer sites and Microsoft Express Route Service Endpoints was interrupted in the Washington DC region.

Workaround: Customers with a failover path would not have been impacted. More information can be found here: aka.ms/s3w930

Root cause and mitigation: A routine maintenance was being conducted on the Microsoft Network in the Washington DC area. As part of the change, a legacy configuration was applied that did not include required routing policy statements. As a result, multiple routes in the Washington DC location were withdrawn, which resulted in the connectivity failures.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): Standard Operational Procedures (SOP) updated for this class of change world-wide including health validation signals at time of change; Additional rigor applied to SOP reviews and changes; and ExpressRoute Engineering will be adding monitoring to generate alerts on all routes that are withdrawn.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

4.9

Azure Active Directory - North Europe

Summary of impact: Between 07:23 and 09:32 UTC on 04 Sep 2017, a subset of customers using Azure Active Directory who may have experienced difficulties when attempting to authenticate into My App resources which are dependent on Azure Active Directory IAM services.

Preliminary root cause: Engineers determined that a recent My App deployment task impacted the authentication process.

Mitigation: Engineers rerouted authentication traffic to a different backend, mitigating the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

august 2017

29.8

RCA: Azure Active Directory - Europe

Summary of impact: Between 07:33 to 10:10 and 13:08 to 14:31 UTC on 29 Aug 2017, a subset of customers using Azure Active Directory may have experienced intermittent client side authorization request failures when connecting to resources. This may have manifested in the form of 504 errors. Impact may also have extended to additional services dependent upon AAD including Azure Active Directory B2C, Azure Backup, Data Lake, Site Recovery, Visual Studio Team Services, and the ability to view support requests created in the Azure Management Portal.

Root cause and mitigation: While optimizing performance of AAD Graph service, a code bug ;was introduced which resulted in high latency and CPU exhaustion for AAD Graph roles, under high traffic load, in two European data centers. This led to intermittent failures for clients calling into AAD Graph in these data centers. This change had been rolled out gradually via safe deployment practices, however this was the first time during rollout that it encountered high traffic volume. Engineering team isolated the issue to this change, rolled the change back and mitigated the incident. We will further improve our safe deployment practices to ensure a variety of traffic shapes get handled during deployment rings and we monitor for leading indicators of failures like increased latencies and resource exhaustion on VMs.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): Improve our safe deployment practices to ensure a variety of traffic shapes get handled during deployment rings [in progress] and monitor for leading indicators of failures like increased latencies and resource exhaustion on VMs [in progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

23.8

Service Fabric, SQL Database, Azure IoT Hub, HDInsight and SQL Data Warehouse - West Central US

Summary of impact: Between 22:38 UTC on 22 Aug 2017 and 00:40 UTC on 23 Aug 2017, a subset of customers using Azure IoT Hub, HDInsight, Service Fabric, SQL Data Warehouse or SQL Database in West Central US who may have experienced difficulties connecting to resources hosted in this region.

Preliminary root cause: Engineers identified an underlying network configuration change as the potential root cause.

Mitigation: Engineers rolled back the recent deployment task to mitigate the issue.

22.8

azure.microsoft.com - Multi-Region

Summary of impact: Between 23:59 UTC on 21 Aug 2017 and 00:42 UTC on 22 Aug 2017, customers might have experienced errors while accessing .

Preliminary root cause: Engineers identified a backend configuration error.

Mitigation: Engineers reconfigured the setting and confirmed mitigation.

21.8

SQL Database - West Europe

Summary of impact: Between 10:13 and 11:09 UTC on 21 Aug 2017, a subset of customers using SQL Database in West Europe may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region.

Mitigation: Engineers determined that the issue was self-healed by the Azure platform.

17.8

RCA – Storage – West US

Summary of impact: Between 18:45 and 20:45 UTC on 17 Aug 2017, a subset of customers using Storage in West US may have experienced intermittent difficulties connecting to Storage resources hosted in this region. Other services that leveraged Storage resources in this region may also have experienced impact, including Virtual Machines (VMs) and Site Recovery.

Workaround: VMs deployed using Managed Disks and Availability Sets of two or more would have maintained availability. Additional information on Managed Disks can be found at the following link:

Root cause and mitigation: A Storage scale unit hosting the disks for a subset of customers VMs suffered unavailability due to the backend system managing the metadata for the storage scale unit hitting a bug which caused the role instances to crash. The metadata system is managed using Paxos ring and relies on a leader to make progress. In this case, a maintenance operation was being performed on the scale unit and as part of the maintenance operation, a command was issued to the metadata system which caused it to hit a bug resulting in the specific role instance to crash. Since the command was persisted into the transaction log of Paxos ring, when the new instance became leader, it tried to execute the same command and failed, eventually leading to the loss of quorum and storage stamp unavailability. To restore the service, the engineering team had to apply a hotfix. The same scale unit was affected again (in a very limited fashion) the next day due to the side effect of the command which had led to the first issue and was not addressed by the hotfix. The maintenance operation had queued up an operation on the replicas of the metadata server to sync with the data nodes (the actual nodes where the data is stored) with the goal to reconcile the state between the two roles. This sync command was issued to data nodes when the metadata server primary changed. Data nodes on receiving the sync command suffered failure as they saw an inconsistency in their state and the state on the metadata server which they could not reconcile. Failed data nodes were able to reconcile the state with other nodes and the system recovered. Engineering teams were engaged and took additional steps to restore the system to its clean state.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1) Harden the backend roles to handle the command which caused the first failure to happen
2) Execute maintenance workflow as part of the service validation

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

17.8

HDInsight - Japan East

Summary of impact: Between 05:30 and 08:18 UTC on 17 Aug 2017, a subset of customers using HDInsight in Japan East may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region.

Mitigation: Engineers performed a refresh operation to the service configuration to mitigate the issue.

juli 2017

4.7

Azure Active Directory - Germany Central and Germany Northeast

Summary of impact: Between approximately 16:15 and 18:55 UTC on 04 Jul 2017, subset of customers using Azure Active Directory in Germany Central or Germany Northeast may have experienced difficulties when attempting to authenticate into resources which are dependent on Azure Active Directory.

Preliminary root cause: Engineers determined that instances of a backend service reached an operational threshold.

Mitigation: Engineers manually added additional resources to alleviate the traffic to the backend service and allow healthy connections to resume.

Next steps: Engineers will continue to validate full recovery and understand the cause of the initial spike in traffic.