Hoppa över navigering

Azure-statushistorik

Produkt:

Region:

Datum:

augusti 2018

1-8

Azure Service Management Failures

Customer Impact: Between 21:14 UTC on 1st Aug and 00:00 UTC 2nd Aug 2018, a subset of customers may have experienced timeouts or failures when attempting to perform service management operations for Azure Resources dependent on management.core.windows.net, this is Azure’s v1 API endpoint. In addition, some customers may have experienced intermittent latency and connection failures to their Classic Cloud Service resources.

Prelimary Root Cause and Mitigation: Underlying Storage experienced larger than normal loads by our internal services, engineers manually reduced the load to mitigate the issue.
1-8

RCA - Networking Connectivity and Latency - East US

Summary of impact: Between 13:18 UTC and 22:40 UTC on 01 Aug 2018, a subset of customers in the East US region may have experienced difficulties or increased latency when connecting to some resources. 

Customer impact: This incident affected a single region, East US, and services deployed with geo-redundancy in multiple regions would not have been impacted. 

Root cause and mitigation: A fiber cut caused by construction approximately 5 km from Microsoft data centers resulted in multiple line breaks, impacting separate splicing enclosures that reduced capacity between 2 Azure data centers in the East US region. As traffic ramped up over the course of the day, congestion occurred and caused packet loss, particularly to services hosted in or dependent on one data center in the region.

Microsoft engineers reduced the impact to Azure customers by reducing internal and discretionary traffic to free up network capacity. Microsoft had been in preparations to add capacity between data centers in the region, and it completed the mitigation of the issue by pulling in the activation of the additional capacity. Repair of the damaged fiber continued after the mitigation, and it will be returned to service when fully ready.

Next steps:
Microsoft regrets the impact to our customers and their services. We design our networks to survive the failure of any single component, however in this instance there was not enough capacity left after the failure to handle all offered traffic. To prevent recurrence, we are taking the following steps:

1) Reducing the utilization threshold for capacity augmentation - in progress.
2) Improving our systems for automatically reducing discretionary traffic during congestion events to preserve performance for customer traffic - in progress. 
3) Improving our ability to identify high-rate flows that can cause impact to other traffic during congestion events and create appropriate ways to handle this traffic - in progress. 

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

juli 2018

31-7

RCA - Storage - North Central US

Summary of impact: Between 17:52 UTC and 18:40 UTC on 31 Jul 2018 a subset of customers using Storage in North Central US may have experienced difficulties connecting to resources hosted in this region. Other services that leverage Storage in this region may also have been experiencing impact related to this.

Root cause and mitigation: During a planned power maintenance activity, a breaker tripped transferring the IT load to its second source. A subset of the devices saw a second breaker trip resulting in the restart of a subset of Storage nodes in one of the availability zones in North Central region. Maintenance was completed with no further issues or impact to services. A majority of Storage resources self-healed. Engineers manually recovered the remaining unhealthy Storage nodes. Additional manual mitigating actions were performed by Storage engineers to fully mitigate the incident.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Engineers have sent failed components to the OEM manufacturer for further testing and analysis [in progress]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey
28-7

App Service - East US - Mitigated

Summary of impact: Between 08:00 and 12:30 UTC on 28 Jul 2018, a subset of customers using App Services in East US may have received HTTP 500-level response codes, have experienced timeouts or high latency when accessing App Service (Web, Mobile and API Apps) deployments hosted in this region.

Preliminary root cause: Engineers identified an internal configuration conflict as part of an ongoing deployment as the potential root cause.

Mitigation: Engineers made a manual adjustment of the deployment configuration to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

26-7

RCA - Azure Service Management Failures

Summary of impact: Between 22:15 on 26 Jul 2018 and 06:20 UTC on 27 Jul 2018, a subset of customers may have experienced timeouts or failures when attempting to perform service management operations for Azure Resources dependent on management.core.windows.net, this is Azure’s v1 API endpoint.  In addition, some customers may have experienced intermittent latency and connection failures to the Azure Management Portal. Some services with a reliance on triggers from service management calls may have seen failures for running instances. 

Root cause and mitigation: The Azure Service Management layer (ASM) is composed of multiple services and components. During this event, a platform service update introduced a dependency conflict between two backend components. 

Engineers established that one of the underlying components was incompatible with the newer version of a dependent component. When this incompatibility occurs, internal communications slow down and may fail. The impact of this is that above a certain load threshold, newer connections to ASM would have a higher fault rate. Azure Service management has a dependency on the impacted components and consequently, was partially impacted. 

The ASM update went through the extensive standard testing, and the fault did not manifest itself when the update transited through first, the pre-production environments, and later through the first Production slice where it baked for three days. There was no indication of such a failure in those environments during this time. 

When the update rolled out to the main Production environment, there was no immediately visible impact.  Hours later, engineers started seeing alerts on failed connections. This was challenging to diagnose as ASM appeared internally healthy and the intermittent connection failures could have been due to multiple reasons related to network or other parts of the infrastructure. The offending components were identified based on a review of detailed logging of the impacted components, and the incident was mitigated by removing the dependency on the faulted code path.

A secondary issue occurred coincidentally which prevented notifications from reaching the Azure Service Health Dashboard or the Azure Portal Service Health blade.  Engineers have determined that as part of a deployment earlier, a bug was introduced in the communications tooling. The issue was discovered during the event, and a rollback was performed immediately. While it was not related to the incident above, it delayed our notifications to our customers.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Refine validation and service upgrade processes to prevent future occurrences with customer impact [In Progress]
2. Enhance telemetry for detecting configuration anomalies of the backend components post service upgrade to mitigate customer impact scenarios faster [In Progress]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

 

24-7

Delays in ARM generated Activity Logs

Summary of impact: Between 23:00 UTC on 24 Jul 2018 and 07:00 UTC on 27 Jul 2018, customers may have not received activity logs for ARM resources. All service management operations for ARM resources would have been successful, however the corresponding activity logs would not have been generated. As a result, any dependent actions or workflows that are dependent on ARM generated activity logs would have not been initiated. Engineers have confirmed that customers can receive ARM generated activity logs as of 07:00 on the 27 July 2018. However, at this time customers may not able to access the activity logs that were generated during this impact time frame. 

Preliminary root cause: Engineers determined that a recent Azure Resource Manager deployment task impacted instances of a backend service which became unhealthy, preventing requests from completing.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue. 

Next steps: This incident is remaining active as engineers access options for the missing logs that would have been generated during the impact window. Any customers who remains impacted will receive communications via their management portal. This is being tracked under the tracking id: FLLH-MRG.

23-7

Error notifications in Microsoft Azure Portal

Summary of impact: Between approximately 20:00 and 22:56 UTC on 23 Jul 2018, a subset of customers may have received timeout errors or failure notifications when attempting to load multiple blades in the Microsoft Azure Portal. Customers may also have experienced slowness or difficulties logging into the Portal. 

Preliminary root cause: Engineers determined that instances of a backend service became unhealthy, preventing these requests from completing. 

Mitigation: Engineers performed a change to the service configuration to return the backend service to a healthy state and mitigate the issue. 

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.
16-7

RCA - Networking - Service Availability

Summary of impact: Between 14:55 and 16:15 UTC on 16 July 2018, a subset of Azure customers in West US, West Central US, India South, India Central, and Australia East may have experienced difficulties connecting to Azure endpoints, which in-turn may have caused errors when accessing Microsoft Services in the impacted regions.

Root cause and mitigation: Azure utilizes the industry standard Border Gateway Protocol (BGP) to advertise network paths (routes) for all Azure services. As part of the monthly route prefix update process for ExpressRoute, a misconfigured ExpressRoute device began advertising a subset of Microsoft routes which conflicted with the Microsoft global network. All traffic destined for the public endpoints via these routes either externally or from within an impacted region could not reach their destination. This included Azure Virtual Machines and other services with a dependence on impacted Azure storage accounts. Engineers responded to the alerts, identified the cause of the routing anomaly, and isolated the portion of the network to mitigate the issue.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Enhanced configuration validation at the time of deployment and automatic detection of configuration anomalies post deployment – in progress

2. Monitoring and alerting of incorrect route advertisement into the Microsoft global network for faster identification and mitigation – in progress

3. Enforce network route protection to ensure that route advertisements are more limited in scope and radius – in progress

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

13-7

RCA - Virtual Machines - UK South

Summary of impact: Between 22:13 UTC on 12 July 2018 and 16:20 UTC on 13 July 2018, a subset of customers using Virtual Machines in UK South may have experienced difficulties connecting to some of their resources hosted in the region. Customers may have been unable to establish RDP or SSH session into their resources. This issue impacted customers with resources hosted on one of the compute scale units hosted in the region.

Root cause and mitigation: The impacted scale unit was introduced into rotation a few days before the impact window. Prior to its release to production and during the build out process, an incorrect network parameter was set which impacted virtual network (VNET) connectivity on this scale unit. This connectivity issue triggered the expected monitoring alerts which were temporally muted from the network monitoring tool as the build out process progressed. Unfortunately, these alerts weren't unmuted prior to the unit accepting workloads. Engineers partially mitigated the issue by asking impacted customers to deallocate, restart impacted workloads and taking the scale unit out of rotation. The issue was fully mitigated by adjusting the incorrect parameter to the expected value.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Scan of recently built scale units to validate that the correct network parameter is set [Done]
2. Scan of recently built scale units to validate that monitoring alerts are unmuted [Done]
3. Roll out automation to remove the manual setting of the network parameter [Done]
4. Add a check to ensure that no network alerts are supressed before scale units are introduced into rotation [In Progress]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

11-7

RCA - SQL Database - Intermittent Database Login Failures - South Central US

Summary of impact: Between 18:38 on 11 Jul 2018 and 01:13 UTC on 12 Jul 2018, a subset of customers using SQL Database in South Central US may have experienced intermittent issues accessing services. New connections to existing databases in this region may have resulted in an error or timeout, and existing connections may have been terminated. Engineers determined that the initial issue was mitigated at 20:54 UTC but login failure recurred again at 22:25 UTC. The overall impact to the region would have been less than 10% of traffic being served to SQL Databases, however the impact to customers in the region should have been substantially lower as retries would have been successful in most cases.

Root cause and mitigation: The South Central US region has two clusters of gateway machines (or connectivity routers) serving login traffic. Engineers determined that a set of network devices serving the gateway machines malfunctioned, causing packet drops leading to intermittent connectivity to Azure SQL DB in the region. One out of two gateway clusters serving the region was impacted. A total of 37 out of 100 gateway nodes in the region experienced packet drops. Login success rate in the region dropped to 90%. Retries would have been successful in most cases. The triggering cause was an increased load in the region. The issue was mitigated by load balancing across the two clusters in the region and thus reducing load on gateway machines in the impacted region.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future. In this case work is underway to increase gateway capacity in the region as well as upgrading network devices.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

1-7

RCA - IoT Hub - Connectivity Issues

Summary of impact: Between 01:30 UTC on 01 Jul 2018 and 18:00 UTC on 02 Jul 2018, a subset of customers using Azure IoT Hub in West Europe, North Europe, East US, and West US may have experienced difficulties connecting to resources hosted in these regions.

Customer impact: During the impact, symptoms may have included failures when creating IoT Hub, connection latencies, and disconnects in the impacted regions.

Root cause and mitigation: A subset of IoT Hub scale units experienced a significant increase in load due to high volume of operational activity and recurring import jobs across multiple instances, which ultimately throttled the backend system. The issue was exacerbated as the loads were unevenly distributed across the clusters.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Optimize throttling of queries to prevent system load – In Progress 
2. Fine-tune load balancing algorithm to provide quicker detection and alerting – In Progress
3. Improve failover strategy to quickly mitigate – In Progress

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

juni 2018

27-6

RCA - App Service - West Europe

Summary of impact: Between 16:00 UTC on 27 Jun 2018 and 13:00 UTC on 28 Jun 2018, a subset of customers using App Service in West Europe may have received HTTP 500-level response codes, timeouts or high latency when accessing App Service (Web, Mobile and API Apps) deployments hosted in this region.

Root cause and mitigation: During a recent platform deployment several App Service scale units in West Europe encountered a backend performance regression due to a modification in the telemetry collection systems. Due to this regression, customer with .NET applications running large workloads may have encountered application slowness. The root cause of this was an inefficiency in the telemetry collection pipeline which caused overall virtual machine performance degradation and slowdown. The issue was detected automatically, and the engineering team was engaged. A mitigation to remove the inefficiency causing the issue was applied at 10:00 UTC on June 28. After further review, a secondary mitigation was applied to a subset of VMs at 22:00 UTC on June 28. More than 90% of the impacted customers saw mitigation at this time. After additional monitoring, a final mitigation was applied to a single remaining scale unit at 15:00 UTC on June 29. All customers were mitigated at this time.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
• Removing the changes that caused the regression
• Reviewing and if necessary adjusting the performance regression detection and alerting

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

25-6

RCA - Multiple Services - South Central US

Summary of impact:  Between 19:40 and 20:52 UTC on 25 Jun 2018, a subset of customers in South Central US may have experienced difficulties connecting to resources and/or 500-level errors hosted in this region. Virtual Machines may have rebooted unexpectedly. Impacted services included: Storage, Virtual Machines, Key Vault, Site Recovery, Machine Learning, Cloud Shell, Logic Apps, Redis Cache, Visual Studio Team Services, Service Bus, ExpressRoute, Application Insights, Backup, Networking, API Management, App Service (Linux) and App Service.

Root cause and mitigation: One storage scale unit in South Central US experienced a load imbalance. This load imbalance caused network congestion on the backend servers. Due to the congestion, many of the read requests started to timeout as the servers which were hosting the data couldn’t complete the request in time. Azure Storage uses erasure coding to more efficiently and durably store data. Since the servers, which were hosting the primary copy of the data, didn’t respond in time, it triggered an alternative read mode (aka reconstruct read) which puts more load on the network. The network load on the backend system became high enough that many of the customer requests including those from VMs started timing out. Our load balancing system takes care of balancing the load across the servers to prevent hot spots from being created but in this case, it was not sufficient. The system recovered on its own. The Storage engineering team applied additional configuration changes to spread the load more evenly and removed some of the load on the scale unit. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
 - Load balancing system changes to handle such load imbalances better [in progress]
 - Controlling the rate and resource usage of the reconstruct read [in progress]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

20-6

RCA - Azure Data Factory Services - Multiple Regions

Summary of impact: Between 12:25 and 15:00 UTC on 20 Jun 2018, customers using Data Factory V1, Data Factory V2, Data Movement, SSIS Integration Runtime, and Data Movement & Dispatch may have experienced errors, including, but not limited to, pipeline execution errors, copy activity and store procedure activity errors, manual and scheduled SSIS activity failures, and data movement and dispatch failures.
As a workaround, customers had to manually restart their self-hosted Integration Runtime environments.

Workaround: Self Hosted Integration Runtime customers had to manually restart their self-hosted Integration Runtime environments.

Root cause and mitigation: The underlying cause of this incident was a faulty service deployment, which resulted in a configuration error being deployed alongside the front end micro service. This configuration is leveraged for all communications to the underlying data movement micro services. As a result of the configuration error, no API calls succeeded. While the deployment did follow all documented safe deployment processes, an earlier update in the configuration rollout caused a configuration to be deployed unexpectedly. The service deployment will automatically load the latest configuration, thus the configuration error became a blocker for the micro service API path. Engineers rolled back the faulty deployment, which reverted to the correct configuration, and thus allowed the API calls to succeed for the underlying micro-services.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): Engineers will modify the configuration issuing and authorization process, so that faulty configurations cannot be deployed in future. This will ensure that any deployment activity with these configurations will not cause an interruption to downstream services.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

19-6

RCA - Service Availability Issue in North Europe

Summary of impact: From 17:44 on 19 Jun 2018 to 04:30 UTC on 20 Jun 2018, customers using Azure services in North Europe may have experienced connection failures when attempting to access resources hosted in the region. Customers leveraging a subset of Azure services may have experienced residual impact for a sustained period post-mitigation of the underlying issue. 

Final root cause and mitigation: On 19 Jun 2018, Datacenter Critical Environments systems in one of our datacenters in the North Europe region experienced an increase in outside air temperature which triggered a confluence of events that contributed to this outage.

The humidity increase resulted in a datacenter control system requesting for evaporative cooling across the data center campus. When the call for cooling occurred, it uncovered issues within the controls system which led to failures within the mechanical system in the data centers. Manual intervention was required to help remediate the escalating temperature and humidity issues within the data center.  

The team’s manual intervention in conjunction with the control system issues did not immediately remediate the environmental issues. However the team was eventually able to clear and reset the control systems, bringing the data center cooling systems back in control lowering the temperature and humidity.

This unexpected rise in humidity levels in the operational areas caused multiple Top of Rack (TORs) network devices and hard disk drives supporting two Storage scale units in the region to experience hardware component failures. These failures caused significant latency and/or intra scale unit communications issues between the servers which led to availability issues for customers with data hosted on the affected scale units. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Correct issues for datacenter evaporative cooling systems in the affected data center [complete]
2. Audit controls system and test functionalities [in progress]
3. Update standard and emergency operational procedures for environmental excursion [complete]
4. Engineers continue to analyze detailed event data to determine if additional environmental systems modifications are required [in progress]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

14-6

Azure Bot Services - Mitigated

Summary of impact: Between 07:15 and 10:06 UTC on 14 Jun 2018, a subset of customers using Azure Bot Service may have experienced difficulties while trying to connect to bot resources.

Preliminary root cause: Engineers identified a code defect with a recent deployment task as the potential root cause.

Mitigation: Engineers performed a rollback of the recent deployment task to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

13-6

RCA - Multiple Azure Services availability issues and Service Management issues for a subset of Classic Azure resources - South Central US

Summary of impact: Between 15:57 UTC and 19:00 UTC on 13 Jun 2018 a subset of customers in South Central US may have experienced difficulties connecting to resources hosted in this region. Engineers have determined that this was caused by an underlying Storage availability issue. Other services that leverage Storage in this region may also be experienced impact related to this. These services may include: Virtual Machines, App Service, Visual Studio Team Services, Logic Apps, Azure Backup, Application Insights, Service Bus, Event Hub, Site Recovery, Azure Search, and Media Services. In addition, customers may have experienced failures when performing Service Management operations on their resources. Communications for Service Management operations issue is published to the Azure Portal.

Root cause and mitigation: One storage scale unit in the South Central US region experienced a significant increase in load, which caused increased resource utilization on the backend servers in the scale unit. The increased resource utilization caused several backend roles to become temporary unresponsive resulting in timeouts and other errors, which caused VMs and other storage dependent services to be impacted. The Storage service utilizes automatic load balancing to help automatically mitigate this type of incident, however in this case automatic load balancing was not sufficient and engineer intervention was required to mitigate the incident. To stabilize the scale unit and backend roles, engineers rebalanced the load. Impacted services recovered shortly thereafter. Engineers are continuing to investigate the cause for the increase in load and implement additional steps to help prevent a reoccurrence.  

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Improve load balancing system to better manage the load across servers -pending
2. Improve resource usage management to better handle low resource situations – pending
3. Improve server restart time after an incident that results in degraded Storage availability – in progress
4. Improve fail over strategy to help ensure that impacted services can more quickly recover - in progress

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

12-6

Visual Studio App Center

Summary of impact: Between 03:37 and 08:06 UTC on 12 Jun 2018, a subset of Visual Studio App Center customers may have experienced impact to build operations such as failure to start. 

Preliminary root cause: Engineers determined that a recent deployment task impacted instances of a backend service which became unhealthy, preventing requests from completing.

Mitigation: Engineers rolled back the recent deployment task to mitigate the issue and have advised that actions taken will also provide mitigation for the limited subset of customers using Visual Studio Team Services that may have experienced delays and failures of VSTS build using Hosted Agent pools in Central US.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

10-6

RCA - Multiple Azure Services impacted in West Europe

Summary of impact: Between approximately 02:24 and 06:20 UTC on 10 Jun 2018, a subset of customers in the West Europe region may have experienced difficulties connecting to their resources due to a storage issue in this region. Multiple Azure Services with a dependency on Storage and/or Virtual Machines also experienced secondary impact to their resources for some customers. Impacted services included: Storage, Virtual Machines, SQL Databases, Backup, Azure Site Recovery, Service Bus, Event Hub, App Service, Logic Apps, Automation, Data Factory, Log Analytics, Stream Analytics, Azure Map, Azure Search, Media Services.

Root cause and mitigation: In the West Europe region, there were two storage scale units which were affected by an unexpected increase in inter-cluster network latency. This network latency caused the geo-replication delays and triggered increased resource utilization on the storage scale units. The increased resource utilization caused several backend roles to become temporary unresponsive resulting in timeouts and other errors causing VMs and other storage dependent services to be impacted. To stabilize the backend roles, engineers rebalanced the load which stabilized the scale units. Our service typically does take care of the load balancing automatically, but in this case, it was not sufficient and required engineer involvement.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
- Improve load balancing system to better manage the load across servers - in progress.  
- Resource usage management to handle low resource situations better- in progress. 

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

maj 2018

31-5

RCA - App Service - Service Management Issues

Summary of impact: Between 21:14 to 22:50 UTC on May 30, 2018 and 06:00 UTC to 08:45 UTC on 31 May 2018, a subset of customers using App Service, Logic Apps, and Functions may have experienced intermittent errors viewing App Services resources, or may have seen the error message, "This resource is unavailable" when attempting to access resources via the Azure portal. In addition, a subset of customers may have received failure notifications when performing service management operations - such as create, update, delete - for resources. Existing runtime resources were not impacted.

Root cause and mitigation: Due to removal of some resource types, the platform triggered a subscription syncing workflow in the Resource Manager. This caused an extremely high volume of update calls to be issued across multiple App Service subscriptions. The high call volume caused a buildup of subscription update operation objects in the backend database and resulted in sub-optimal performance of high frequency queries. The database usage reached a critical point and led to resource starvation for other API calls in the request pipeline. As a result, some customer requests encountered long processing latencies or failures. Engineers stopped the workflow to mitigate the issue.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Ensure that subscription update operation objects are cleaned up immediately [complete]

2. Better optimization implemented for high frequency queries [complete]

3. Additional monitoring to maintain effective throttling from resource manager [pending]

30-5

RCA - Virtual Machines - Service Management Issues - USGov Virginia

Summary of impact: Between 05:30 and 17:46 EST on 30 May 2018 a subset of customers in USGov Virginia may have been unable to manage some Virtual Machines hosted in the region. Restart attempts may have failed or machines may have appeared to be stuck in a starting state. Other dependent Azure services experienced downstream impact. Services included were Media Services, Redis Cache, Log Analytics, and Virtual Networks.

Root cause and mitigation: Engineers determined that one of the platform services was using high CPU due to an unexpected scenario. Engineers were aware of the potential issue with this platform service, and already had a fix in place; unfortunately this incident occurred before engineers rolled out the fix to the environment. This issue caused the core platform service, responsible for service management requests, to became unhealthy, preventing requests from completing. Engineers used platform tools to disable the high CPU consuming service. In addition, engineers applied a service configuration change to expedite recovery and return the core platform service to a healthy state.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Apply a fix patch on the service that was using high CPU and complete its rollout to all environments [Completed].
2. Actively monitor resource utilization (CPU, Memory and Hard Drive) from all platform services that could impact core platform degradation [Completed].
3. Define strict resource utilization thresholds for platform services [In Progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

25-5

SQL DB - West Europe - Mitigated

Summary of impact: Between 05:00 UTC on 25 May 2018 and 1:36 UTC on 26 May 2018, a subset of customers using SQL Database in West Europe may have experienced intermittent increases in latency, timeouts and/or connectivity issues when accessing databases in this region. New connections to existing databases in this region may also have resulted in an error or timeout, and existing connections may have been terminated.

Preliminary root cause: Engineers identified that this issue was related to a recent deployment.

Mitigation: Engineers rolled back the deployment, performed a configuration change and redeployed to mitigate the issue.

23-5

RCA - App Service, Logic Apps, Functions - Service Management Issues

Summary of impact: Between 18:20 and 21:15 UTC on 23 May 2018, a subset of customers using App Service, Logic Apps, and Azure Functions may have experienced latency or timeouts when viewing resources in the Azure Portal. Some customers may also have seen errors when performing service management operations such as resource creation, update, delete, move and scaling within the Azure Portal or through programmatic methods. Application runtime availability was not impacted by this event.

Root cause and mitigation: The root cause of this issue was that some instances of a backend service in App Services, which are responsible for processing service management requests, experienced a CPU usage spike. This spike was due to a large number of data intensive queries being simultaneously processed across multiple instances, causing high load in the backend service. These queries required processing of very large data sets which consumed the higher than normal CPU, and prevented other requests from completing.
The issue was automatically detected by our internal monitoring and engineers deployed a hotfix to throttle the CPU intensive queries and free up resources. The incident was fully mitigated by 21:15 UTC on 23 May 2018.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Implement permanent throttling of queries to prevent causing system load – Completed.
2. Improve throughput and overall performance by adding processing optimization for CPU intensive queries – In Progress
3. Enhance incident notification to provide quicker and more accurate customer communications – In Progress

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

22-5

Visual Studio Team Service - Availability and Performance Issues

Summary of impact: Between 14:55 and 16:55 UTC on 22 May 2018, a subset of customers using Visual Studio Team Services may have experienced degraded performance and latency when accessing accounts or navigating through workspaces. Additional information can be found at the VSTS Blog here - .

Preliminary root cause: Engineers determined that the underlying root cause was a recent configuration settings change which conflicted with prior settings.

Mitigation: Engineers rolled back the configuration settings to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

22-5

RCA - Azure Active Directory - Issues Authenticating to Azure Resources via Microsoft Account (MSA)

Summary of impact: Between 02:19 and 03:35 UTC on 22 May 2018, a subset of customers attempting to login to their Microsoft Account (MSA) using Azure Active Directory (AAD) may have experienced intermittent difficulties when attempting to authenticate into resources which are dependent on AAD. In addition, Visual Studio Team Services (VSTS) customers may have experienced errors attempting to login to VSTS portal through MSA using https://<Accountname>.visualstudio.com. Additional experiences that rely on Microsoft Account sign in may have had intermittent errors.

Root cause and mitigation: Engineers determined that a recent maintenance task introduced a misconfiguration setting which impacted instances of a backend cache service. This resulted in some authentication requests failing to complete. This was immediately detected and engineers initiated a manual fail over of the service to healthy data centers to mitigate the impact. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Increase Isolation between service tiers to remove risk to service from component level destabilization and reduce fault domain – in progress.
2. Automate failovers to mitigate customer impact scenarios faster – in progress.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey: