ナビゲーションをスキップする

Azure の状態の履歴

製品:

リージョン:

日付:

2018年7月

7/16

RCA - Networking - Service Availability

Summary of impact: Between 14:55 and 16:15 UTC on 16 July 2018, a subset of Azure customers in West US, West Central US, India South, India Central, and Australia East may have experienced difficulties connecting to Azure endpoints, which in-turn may have caused errors when accessing Microsoft Services in the impacted regions.

Root cause and mitigation: Azure utilizes the industry standard Border Gateway Protocol (BGP) to advertise network paths (routes) for all Azure services. As part of the monthly route prefix update process for ExpressRoute, a misconfigured ExpressRoute device began advertising a subset of Microsoft routes which conflicted with the Microsoft global network. All traffic destined for the public endpoints via these routes either externally or from within an impacted region could not reach their destination. This included Azure Virtual Machines and other services with a dependence on impacted Azure storage accounts. Engineers responded to the alerts, identified the cause of the routing anomaly, and isolated the portion of the network to mitigate the issue.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Enhanced configuration validation at the time of deployment and automatic detection of configuration anomalies post deployment – in progress

2. Monitoring and alerting of incorrect route advertisement into the Microsoft global network for faster identification and mitigation – in progress

3. Enforce network route protection to ensure that route advertisements are more limited in scope and radius – in progress

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

7/11

RCA - SQL Database - Intermittent Database Login Failures - South Central US

Summary of impact: Between 18:38 on 11 Jul 2018 and 01:13 UTC on 12 Jul 2018, a subset of customers using SQL Database in South Central US may have experienced intermittent issues accessing services. New connections to existing databases in this region may have resulted in an error or timeout, and existing connections may have been terminated. Engineers determined that the initial issue was mitigated at 20:54 UTC but login failure recurred again at 22:25 UTC. The overall impact to the region would have been less than 10% of traffic being served to SQL Databases, however the impact to customers in the region should have been substantially lower as retries would have been successful in most cases.

Root cause and mitigation: The South Central US region has two clusters of gateway machines (or connectivity routers) serving login traffic. Engineers determined that a set of network devices serving the gateway machines malfunctioned, causing packet drops leading to intermittent connectivity to Azure SQL DB in the region. One out of two gateway clusters serving the region was impacted. A total of 37 out of 100 gateway nodes in the region experienced packet drops. Login success rate in the region dropped to 90%. Retries would have been successful in most cases. The triggering cause was an increased load in the region. The issue was mitigated by load balancing across the two clusters in the region and thus reducing load on gateway machines in the impacted region.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future. In this case work is underway to increase gateway capacity in the region as well as upgrading network devices.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

7/1

RCA - IoT Hub - Connectivity Issues

Summary of impact: Between 01:30 UTC on 01 Jul 2018 and 18:00 UTC on 02 Jul 2018, a subset of customers using Azure IoT Hub in West Europe, North Europe, East US, and West US may have experienced difficulties connecting to resources hosted in these regions.

Customer impact: During the impact, symptoms may have included failures when creating IoT Hub, connection latencies, and disconnects in the impacted regions.

Root cause and mitigation: A subset of IoT Hub scale units experienced a significant increase in load due to high volume of operational activity and recurring import jobs across multiple instances, which ultimately throttled the backend system. The issue was exacerbated as the loads were unevenly distributed across the clusters.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Optimize throttling of queries to prevent system load – In Progress 
2. Fine-tune load balancing algorithm to provide quicker detection and alerting – In Progress
3. Improve failover strategy to quickly mitigate – In Progress

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

2018年6月

6/27

RCA - App Service - West Europe

Summary of impact: Between 16:00 UTC on 27 Jun 2018 and 13:00 UTC on 28 Jun 2018, a subset of customers using App Service in West Europe may have received HTTP 500-level response codes, timeouts or high latency when accessing App Service (Web, Mobile and API Apps) deployments hosted in this region.

Root cause and mitigation: During a recent platform deployment several App Service scale units in West Europe encountered a backend performance regression due to a modification in the telemetry collection systems. Due to this regression, customer with .NET applications running large workloads may have encountered application slowness. The root cause of this was an inefficiency in the telemetry collection pipeline which caused overall virtual machine performance degradation and slowdown. The issue was detected automatically, and the engineering team was engaged. A mitigation to remove the inefficiency causing the issue was applied at 10:00 UTC on June 28. After further review, a secondary mitigation was applied to a subset of VMs at 22:00 UTC on June 28. More than 90% of the impacted customers saw mitigation at this time. After additional monitoring, a final mitigation was applied to a single remaining scale unit at 15:00 UTC on June 29. All customers were mitigated at this time.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
• Removing the changes that caused the regression
• Reviewing and if necessary adjusting the performance regression detection and alerting

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

6/25

RCA - Multiple Services - South Central US

Summary of impact:  Between 19:40 and 20:52 UTC on 25 Jun 2018, a subset of customers in South Central US may have experienced difficulties connecting to resources and/or 500-level errors hosted in this region. Virtual Machines may have rebooted unexpectedly. Impacted services included: Storage, Virtual Machines, Key Vault, Site Recovery, Machine Learning, Cloud Shell, Logic Apps, Redis Cache, Visual Studio Team Services, Service Bus, ExpressRoute, Application Insights, Backup, Networking, API Management, App Service (Linux) and App Service.

Root cause and mitigation: One storage scale unit in South Central US experienced a load imbalance. This load imbalance caused network congestion on the backend servers. Due to the congestion, many of the read requests started to timeout as the servers which were hosting the data couldn’t complete the request in time. Azure Storage uses erasure coding to more efficiently and durably store data. Since the servers, which were hosting the primary copy of the data, didn’t respond in time, it triggered an alternative read mode (aka reconstruct read) which puts more load on the network. The network load on the backend system became high enough that many of the customer requests including those from VMs started timing out. Our load balancing system takes care of balancing the load across the servers to prevent hot spots from being created but in this case, it was not sufficient. The system recovered on its own. The Storage engineering team applied additional configuration changes to spread the load more evenly and removed some of the load on the scale unit. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
 - Load balancing system changes to handle such load imbalances better [in progress]
 - Controlling the rate and resource usage of the reconstruct read [in progress]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

6/20

RCA - Azure Data Factory Services - Multiple Regions

Summary of impact: Between 12:25 and 15:00 UTC on 20 Jun 2018, customers using Data Factory V1, Data Factory V2, Data Movement, SSIS Integration Runtime, and Data Movement & Dispatch may have experienced errors, including, but not limited to, pipeline execution errors, copy activity and store procedure activity errors, manual and scheduled SSIS activity failures, and data movement and dispatch failures.
As a workaround, customers had to manually restart their self-hosted Integration Runtime environments.

Workaround: Self Hosted Integration Runtime customers had to manually restart their self-hosted Integration Runtime environments.

Root cause and mitigation: The underlying cause of this incident was a faulty service deployment, which resulted in a configuration error being deployed alongside the front end micro service. This configuration is leveraged for all communications to the underlying data movement micro services. As a result of the configuration error, no API calls succeeded. While the deployment did follow all documented safe deployment processes, an earlier update in the configuration rollout caused a configuration to be deployed unexpectedly. The service deployment will automatically load the latest configuration, thus the configuration error became a blocker for the micro service API path. Engineers rolled back the faulty deployment, which reverted to the correct configuration, and thus allowed the API calls to succeed for the underlying micro-services.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): Engineers will modify the configuration issuing and authorization process, so that faulty configurations cannot be deployed in future. This will ensure that any deployment activity with these configurations will not cause an interruption to downstream services.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

6/19

RCA - Service availability issue in North Europe

Summary of impact: From 17:44 on 19 Jun 2018 to 04:30 UTC on 20 Jun 2018, customers using Azure services in North Europe may have experienced connection failures when attempting to access resources hosted in the region. Customers leveraging a subset of Azure services may have experienced residual impact for a sustained period post-mitigation of the underlying issue. 

Preliminary root cause: On 19 Jun 2018, Data Center Critical Environments systems in one of our datacenters in the North Europe region experienced an increase in outside air temperature. During the process of maintaining the inside operating temperatures within operational specifications, we experienced a control systems failure in a limited area of the datacenter that triggered an unexpected rise in humidity levels within 2 of our colos. This unexpected rise in humidity levels in the operational areas caused multiple Top of Rack (TORs) network devices and hard disk drives supporting two Storage scale units in the region to experience hardware component failures. These failures caused significant latency and/or intra scale unit communications issues between the servers which led to availability issues for customers with data hosted on the affected scale units.

Mitigation: The control systems failure/humidity rise was quickly resolved by Data Center engineers restoring inside operational conditions, however the server node reboots and hard drive failures caused Storage, Compute and downstream services to require additional structured recovery operations to fully restore services. Once the network communication to the affected Storage scale units was restored, Storage availability was restored for most customers. A limited subset of customers experienced an extended recovery while engineers worked to restore failed disk drives.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

 - Engineers continue to analyze detailed event data to determine if additional environmental systems modifications are required.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

6/14

Azure Bot Services - Mitigated

Summary of impact: Between 07:15 and 10:06 UTC on 14 Jun 2018, a subset of customers using Azure Bot Service may have experienced difficulties while trying to connect to bot resources.

Preliminary root cause: Engineers identified a code defect with a recent deployment task as the potential root cause.

Mitigation: Engineers performed a rollback of the recent deployment task to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

6/13

RCA - Multiple Azure Services availability issues and Service Management issues for a subset of Classic Azure resources - South Central US

Summary of impact: Between 15:57 UTC and 19:00 UTC on 13 Jun 2018 a subset of customers in South Central US may have experienced difficulties connecting to resources hosted in this region. Engineers have determined that this was caused by an underlying Storage availability issue. Other services that leverage Storage in this region may also be experienced impact related to this. These services may include: Virtual Machines, App Service, Visual Studio Team Services, Logic Apps, Azure Backup, Application Insights, Service Bus, Event Hub, Site Recovery, Azure Search, and Media Services. In addition, customers may have experienced failures when performing Service Management operations on their resources. Communications for Service Management operations issue is published to the Azure Portal.

Root cause and mitigation: One storage scale unit in the South Central US region experienced a significant increase in load, which caused increased resource utilization on the backend servers in the scale unit. The increased resource utilization caused several backend roles to become temporary unresponsive resulting in timeouts and other errors, which caused VMs and other storage dependent services to be impacted. The Storage service utilizes automatic load balancing to help automatically mitigate this type of incident, however in this case automatic load balancing was not sufficient and engineer intervention was required to mitigate the incident. To stabilize the scale unit and backend roles, engineers rebalanced the load. Impacted services recovered shortly thereafter. Engineers are continuing to investigate the cause for the increase in load and implement additional steps to help prevent a reoccurrence.  

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Improve load balancing system to better manage the load across servers -pending
2. Improve resource usage management to better handle low resource situations – pending
3. Improve server restart time after an incident that results in degraded Storage availability – in progress
4. Improve fail over strategy to help ensure that impacted services can more quickly recover - in progress

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

6/12

Visual Studio App Center

Summary of impact: Between 03:37 and 08:06 UTC on 12 Jun 2018, a subset of Visual Studio App Center customers may have experienced impact to build operations such as failure to start. 

Preliminary root cause: Engineers determined that a recent deployment task impacted instances of a backend service which became unhealthy, preventing requests from completing.

Mitigation: Engineers rolled back the recent deployment task to mitigate the issue and have advised that actions taken will also provide mitigation for the limited subset of customers using Visual Studio Team Services that may have experienced delays and failures of VSTS build using Hosted Agent pools in Central US.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

6/10

RCA - Multiple Azure Services impacted in West Europe

Summary of impact: Between approximately 02:24 and 06:20 UTC on 10 Jun 2018, a subset of customers in the West Europe region may have experienced difficulties connecting to their resources due to a storage issue in this region. Multiple Azure Services with a dependency on Storage and/or Virtual Machines also experienced secondary impact to their resources for some customers. Impacted services included: Storage, Virtual Machines, SQL Databases, Backup, Azure Site Recovery, Service Bus, Event Hub, App Service, Logic Apps, Automation, Data Factory, Log Analytics, Stream Analytics, Azure Map, Azure Search, Media Services.

Root cause and mitigation: In the West Europe region, there were two storage scale units which were affected by an unexpected increase in inter-cluster network latency. This network latency caused the geo-replication delays and triggered increased resource utilization on the storage scale units. The increased resource utilization caused several backend roles to become temporary unresponsive resulting in timeouts and other errors causing VMs and other storage dependent services to be impacted. To stabilize the backend roles, engineers rebalanced the load which stabilized the scale units. Our service typically does take care of the load balancing automatically, but in this case, it was not sufficient and required engineer involvement.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
- Improve load balancing system to better manage the load across servers - in progress.  
- Resource usage management to handle low resource situations better- in progress. 

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

2018年5月

5/31

RCA - App Service - Service Management Issues

Summary of impact: Between 21:14 to 22:50 UTC on May 30, 2018 and 06:00 UTC to 08:45 UTC on 31 May 2018, a subset of customers using App Service, Logic Apps, and Functions may have experienced intermittent errors viewing App Services resources, or may have seen the error message, "This resource is unavailable" when attempting to access resources via the Azure portal. In addition, a subset of customers may have received failure notifications when performing service management operations - such as create, update, delete - for resources. Existing runtime resources were not impacted.

Root cause and mitigation: Due to removal of some resource types, the platform triggered a subscription syncing workflow in the Resource Manager. This caused an extremely high volume of update calls to be issued across multiple App Service subscriptions. The high call volume caused a buildup of subscription update operation objects in the backend database and resulted in sub-optimal performance of high frequency queries. The database usage reached a critical point and led to resource starvation for other API calls in the request pipeline. As a result, some customer requests encountered long processing latencies or failures. Engineers stopped the workflow to mitigate the issue.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Ensure that subscription update operation objects are cleaned up immediately [complete]

2. Better optimization implemented for high frequency queries [complete]

3. Additional monitoring to maintain effective throttling from resource manager [pending]

5/30

RCA - Virtual Machines - Service Management Issues - USGov Virginia

Summary of impact: Between 05:30 and 17:46 EST on 30 May 2018 a subset of customers in USGov Virginia may have been unable to manage some Virtual Machines hosted in the region. Restart attempts may have failed or machines may have appeared to be stuck in a starting state. Other dependent Azure services experienced downstream impact. Services included were Media Services, Redis Cache, Log Analytics, and Virtual Networks.

Root cause and mitigation: Engineers determined that one of the platform services was using high CPU due to an unexpected scenario. Engineers were aware of the potential issue with this platform service, and already had a fix in place; unfortunately this incident occurred before engineers rolled out the fix to the environment. This issue caused the core platform service, responsible for service management requests, to became unhealthy, preventing requests from completing. Engineers used platform tools to disable the high CPU consuming service. In addition, engineers applied a service configuration change to expedite recovery and return the core platform service to a healthy state.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Apply a fix patch on the service that was using high CPU and complete its rollout to all environments [Completed].
2. Actively monitor resource utilization (CPU, Memory and Hard Drive) from all platform services that could impact core platform degradation [Completed].
3. Define strict resource utilization thresholds for platform services [In Progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

5/25

SQL DB - West Europe - Mitigated

Summary of impact: Between 05:00 UTC on 25 May 2018 and 1:36 UTC on 26 May 2018, a subset of customers using SQL Database in West Europe may have experienced intermittent increases in latency, timeouts and/or connectivity issues when accessing databases in this region. New connections to existing databases in this region may also have resulted in an error or timeout, and existing connections may have been terminated.

Preliminary root cause: Engineers identified that this issue was related to a recent deployment.

Mitigation: Engineers rolled back the deployment, performed a configuration change and redeployed to mitigate the issue.

5/23

RCA - App Service, Logic Apps, Functions - Service Management Issues

Summary of impact: Between 18:20 and 21:15 UTC on 23 May 2018, a subset of customers using App Service, Logic Apps, and Azure Functions may have experienced latency or timeouts when viewing resources in the Azure Portal. Some customers may also have seen errors when performing service management operations such as resource creation, update, delete, move and scaling within the Azure Portal or through programmatic methods. Application runtime availability was not impacted by this event.

Root cause and mitigation: The root cause of this issue was that some instances of a backend service in App Services, which are responsible for processing service management requests, experienced a CPU usage spike. This spike was due to a large number of data intensive queries being simultaneously processed across multiple instances, causing high load in the backend service. These queries required processing of very large data sets which consumed the higher than normal CPU, and prevented other requests from completing.
The issue was automatically detected by our internal monitoring and engineers deployed a hotfix to throttle the CPU intensive queries and free up resources. The incident was fully mitigated by 21:15 UTC on 23 May 2018.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Implement permanent throttling of queries to prevent causing system load – Completed.
2. Improve throughput and overall performance by adding processing optimization for CPU intensive queries – In Progress
3. Enhance incident notification to provide quicker and more accurate customer communications – In Progress

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

5/22

Visual Studio Team Service - Availability and Performance Issues

Summary of impact: Between 14:55 and 16:55 UTC on 22 May 2018, a subset of customers using Visual Studio Team Services may have experienced degraded performance and latency when accessing accounts or navigating through workspaces. Additional information can be found at the VSTS Blog here - .

Preliminary root cause: Engineers determined that the underlying root cause was a recent configuration settings change which conflicted with prior settings.

Mitigation: Engineers rolled back the configuration settings to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

5/22

RCA - Azure Active Directory - Issues Authenticating to Azure Resources via Microsoft Account (MSA)

Summary of impact: Between 02:19 and 03:35 UTC on 22 May 2018, a subset of customers attempting to login to their Microsoft Account (MSA) using Azure Active Directory (AAD) may have experienced intermittent difficulties when attempting to authenticate into resources which are dependent on AAD. In addition, Visual Studio Team Services (VSTS) customers may have experienced errors attempting to login to VSTS portal through MSA using https://<Accountname>.visualstudio.com. Additional experiences that rely on Microsoft Account sign in may have had intermittent errors.

Root cause and mitigation: Engineers determined that a recent maintenance task introduced a misconfiguration setting which impacted instances of a backend cache service. This resulted in some authentication requests failing to complete. This was immediately detected and engineers initiated a manual fail over of the service to healthy data centers to mitigate the impact. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Increase Isolation between service tiers to remove risk to service from component level destabilization and reduce fault domain – in progress.
2. Automate failovers to mitigate customer impact scenarios faster – in progress.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

5/17

RCA - Japan East - Service Management Operations Issues with Virtual Machines

Summary of impact: Between 23:06 UTC on 17 May 2018 and 01:30 UTC on 18 May 2018, a subset of customers using Virtual Machines in Japan East may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region.

Root cause and mitigation: The Azure Load Balancing service was in the process of upgrading the operating system when we encountered a hardware issue on a small number of scale units in the region. These scale units were running older hardware that had compatibility problems where the network hardware would stop forwarding packets after a period of time and would require a reimage of the OS to continue working. The time between failures was related to the load on the physical network. The load balancer was not directly servicing customer traffic, instead it was supporting a subset of the Azure management services and as a result these services were unavailable, resulting in failures to service management operations that interacted with these services. To mitigate the issue the operating system was rolled back to the previous version. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Roll out a fix to the network hardware which caused Azure Load Balancing instance availability to crash. In the interim a temporary mitigation has been applied to prevent this from resurfacing in any other region. 
5/15

Visual Studio App Center - Distribution Center

Summary of impact: Between 17:15 and 20:00 UTC on 15 May 2018, a subset of customers using Visual Studio App Center may have received error notifications when attempting to use the distribute feature in Distribution Center.

Mitigation: Engineers performed a manual restart of a backend service to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

5/3

RCA - Multiple Services - West Central US

Summary of impact: Between 19:47 and 22:05 UTC on 03 May 2018, customers in West Central US may have experienced difficulties connecting to resources hosted in this region. The incident was caused by a configuration change deployed to update Management Access Control Lists (ACLs). The change was deployed to network switches in a subset of clusters in the West Central US Region. The configuration was rolled back to mitigate the incident.

Root cause and mitigation: Azure uses Management ACLs to limit access to the management plane of network switches to a small number of approved network management services and must occasionally update these ACLs. Management ACLs should normally have no effect on the flow of customer traffic through the router. In this incident, the Management ACLs were incorrectly applied in some network switches due to the differences in how ACLs are interpreted between different operating system versions for the network switches. In the impacted switches, this led to the blocking of critical routing Border Gateway Protocol (BGP) traffic, which led to the switches being unable to forward traffic into a subset of the storage clusters in this region. This loss of connectivity to a subset of storage accounts resulted in impact to VMs and services with a dependence on those storage accounts. Engineers responded to the alerts and mitigated the incident by rolling back the configuration changes. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Enhancing configuration deployment testing / validation processes and automation to account for variances in network switch operating systems - in progress
2. Reviewing and enhancing the deployment methods, procedures, and automation by incorporating additional network health signals - in progress
3. Monitoring and alerting improvements for Border Gateway Protocol (BGP) - in progress

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

2018年4月

4/26

RCA - App Service - Canada Central

Summary of impact: Between 02:45 to 3:28 UTC and 11:53 to 13:30 UTC on April 26, 2018, customers using App Service in Canada Central may have intermittently received HTTP 500-level response codes, experience timeouts or high latencies when accessing App Service deployments hosted in this region. 

Root cause and mitigation: The root cause for the issue was that there was a significant increase in HTTP traffic to certain sites deployed to this region. The rate of requests was so much higher than usual that it exceeded the capacity of the load balancers in that region. Load balancer throttling rules were applied for mitigation initially. However, after a certain threshold, existing throttling rules were unable to keep up with the continued increase in request rate. A secondary mitigation was applied to load balancer instances to further throttle the incoming requests. This fully mitigated the issue.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Adding aggressive automated throttling to handle unusual increases in request rate
2. Adding network layer protection to prevent malicious spikes in traffic

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

4/26

Traffic Manager - Connectivity Issues

Summary of impact: Between 12:46 and 18:01 UTC on 26 Apr 2018, a subset of customers using Traffic Manager may have encountered sub-optimal traffic routing or may have received alerts relating to degraded endpoints. Customers were provided a workaround during the incident.

Preliminary root cause: A configuration issue with a backend network route caused issues with Traffic Manager probes reaching customer endpoints while checking the endpoint health status which led to those endpoints being marked as unhealthy and traffic routed away from them.

Mitigation: Engineers made mapping updates to the network route which mitigated the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.