Azure status history

Product:

Region:

Date:

May, 2019

13-5

Network Connectivity - Increased Latency

Summary of impact: Between 09:05 and 14:00 UTC on 13 May 2019, a subset of customers may have experienced latency or intermittent connectivity issues when accessing Azure resources. Retries would have worked for most customers during this time. 

Preliminary root cause: Engineers have determined that this was caused by a transient networking issue related to traffic management protocols.

Mitigation: Engineers revised the internal traffic management thresholds to ensure that all inbound traffic was correctly handled, and this mitigated the issue. Engineers performed proactive monitoring after this event to ensure no other unexpected traffic impact was experienced by customers.

Next steps: Engineers will continue to investigate to ensure the underlying cause, and that related traffic management thresholds are fully explored to ensure there are no future occurrences.  Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation

7-5

SQL Services - West Europe

Summary of impact: Between 10:57 and 12:48 UTC on 07 May 2019, a subset of customers using SQL Database, SQL Data Warehouse,  Azure Database for PostgreSQL, Azure Database for MySQL, Azure Database for MariaDB, in West Europe may have experienced issues performing service management operations – such as create, update, rename and delete- for resources hosted in this region. 
In addition, customers may have been unable to see their list of databases using SMSS. However as this was a Service Management issue, these databases would not have been impacted (despite not being visible from SMSS).

Preliminary root cause: Engineers identified a back-end database service responsible for processing service management requests in the region became unhealthy preventing the requests from completing. 

Mitigation: Engineers performed a manual restart of the impacting back-end service, which restored its capacity to process requests, mitigating the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation

2-5

RCA - Network Connectivity - DNS Resolution

Summary of impact: Between 19:29 and 22:35 UTC on 02 May 2019, customers may have experienced connectivity issues with Microsoft cloud services including Azure, Microsoft 365, Dynamics 365 and Azure DevOps. Most services were recovered by 21:40 UTC with the remaining recovered by 22:35 UTC.

Root cause: As part of planned maintenance activity, Microsoft engineers executed a configuration change to update one of the name servers for DNS zones used to reach several Microsoft services, including Azure Storage and Azure SQL Database. A failure in the change process resulted in one of the four name servers' records for these zones to point to a DNS server having blank zone data and returning negative responses. The result was that approximately 25% of the queries for domains used by these services (such as database.windows.net) produced incorrect results, and reachability to these services was degraded. Consequently, multiple other Azure and Microsoft services that depend upon these core services were also impacted to varying degrees.

More details: This incident resulted from the coincidence of two separate errors.  Either error by itself would have been non-impacting:

1) Microsoft engineers executed a name server delegation change to update one name server for several Microsoft zones including Azure Storage and Azure SQL Database. Each of these zones has four name servers for redundancy, and the update was made to only one name server during this maintenance. A misconfiguration in the parameters of the automation being used to make the change resulted in an incorrect delegation for the name server under maintenance.
2) As an artifact of automation from prior maintenance, empty zone files existed on servers that were not the intended target of the assigned delegation. This by itself was not a problem as these name servers were not serving the zones in question.

Due to the configuration error in change automation in this instance, the name server delegation made during the maintenance targeted a name server that had an empty copy of the zones. As a result, this name server replied with negative (nxdomain) answers to all queries in the zones. Since only one out of the four name server's records for the zones was incorrect, approximately one in four queries for the impacted zones would have received an incorrect negative response.

DNS resolvers may cache negative responses for some period of time (negative caching), so even though erroneous configuration was promptly fixed, customers continued to be impacted by this change for varying lengths of time.

Mitigation: To mitigate the issue, Microsoft engineers corrected the delegation issue by reverting the name server value to the previous setting. Engineers verified that all responses were then correct, and the DNS resolvers began returning correct results within 5 minutes. Some applications and services that accessed the incorrect values and cached the results may have experienced longer restoration times until the expiration of the incorrect cached information.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1) Additional checks in the code that performs nameserver updates to prevent unintended changes [in progress].
2) Pre-execution modeling to accurately predict the outcome of the change and detect potential problems before execution [in progress].
3) Improve per-zone, per-nameserver monitors to immediately detect changes that cause one nameserver’s drift from the others [in progress].
4) Improve DNS namespace design to better allow staged rollouts of changes with lower incremental impact [in progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

2-5

Azure Map - Mitigated

Summary of impact: Between 04:35 and 11:00 UTC on 02 May 2019, a subset of customers using Azure Maps may have experienced 500 errors when attempting to make calls to Azure Maps Rest APIs. 

Preliminary root cause: Engineers identified that some instances of a front-end service responsible for routing customer requests contained an incorrect software configuration which caused requests to fail. 

Mitigation: Engineers performed a change to the configuration thus, ensuring that requests routed successfully. 

Next steps: Engineers will perform a full root cause analysis to prevent future occurrences.

1-5

Issue signing in to https://shell.azure.com

Summary of impact: Between 18:00 UTC on 30 Apr 2019 and 23:20 UTC on 01 May 2019, customers may have experienced issues signing in to
During this time, customers were able to access Cloud Shell through the Azure portal at

Preliminary root cause: Engineers identified a mis-match between a configuration file which had been recently updated and its corresponding code in shell.azure.com

Mitigation: The Cloud Shell team developed, tested, and rolled out a new build which addressed and corrected the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. 

April, 2019

19-4

RCA - Availability degradation for Azure DevOps

Summary of impact: Between 03:30 and 15:20 UTC, and then again between 17:00 and 17:32 UTC on the 19 Apr 2019, a subset of customers experienced issues connecting to Azure DevOps. These issues primarily affected customers physically located on the East Coast and those whose organizations are located on the East Coast.

Root cause: During a planned maintenance event for Azure Front Door (AFD), a configuration change caused network traffic to be incorrectly advertised. The AFD ring impacted by this maintenance hosted Azure DevOps and other Microsoft internal tenants. This may have resulted in timeouts and 500 errors for customers of Azure DevOps. 
The maintenance event started at 3:30 UTC, which started dropping around 5-10% of requests. When the environment severely degraded at 14:44 UTC, engineering observed the major impact start. The maintenance event was on a ToR (Top of Rack) switch. The standard operating procedure is to take the environment offline by removing edge machines. By design, the MUX stopped advertising BGP (Border Gateway Protocol) routes and traffic is not routed through these MUX. Within this environment one of the MUX Load Balancers was in an unhealthy state but the BGP session between the load balancer and the TOR was still active. Consequently, the MUX was still active in the environment and the TOR was advertising traffic incorrectly.

Mitigation: The first impact window was mitigated by withdrawing the invalid route so that traffic would be routed correctly. The recurrence was caused by the maintenance process resetting the configuration back to the previous state, publishing an invalid route. The 2nd mitigation was re-applying the change again.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • reviewing and implementing more stringent measures for when we take environments offline for maintenance events.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

16-4

RCA - Networking Degradation - Australia Southeast / Australia East

Summary of impact: Between 07:12 and 08:02 UTC on 16 Apr 2019, a subset of customers with resources in Australia Southeast / Australia East may have experienced difficulties connecting to Azure endpoints, which in-turn may have caused errors when accessing Microsoft Services in the impacted regions. 

Root cause: Microsoft received automated notification alerts that the Australia East and Australia Southeast regions were experiencing degraded network availability from a select number of Internet Service Providers (ISPs). During this time, a subset of network prefix paths changed for the select number of ISPs, this manifested in traffic not reaching the destinations within the Australia East and Australia Southeast regions. The issue stemmed from a routing anomaly due to an erroneous advertisement of prefixes received via an ExpressRoute circuit to an Internet Exchange (IX).

Mitigation: Microsoft disabled the incorrect ExpressRoute peering. The IX also identified a high amount of traffic and automatically mitigated by bringing down the peering with the IX.  Once the peerings were brought down by Microsoft and the IX, availability was restored to Australia East and Australia Southeast regions.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 

  • Engage Internet Service Providers to add additional policies/protections to Internet facing routing infrastructure to block future routing anomalies [Complete] 
  • Add additional automated route mitigation steps within the Azure platform to reduce mitigation time [In Progress]
  • Investigate further route optimizations in the Azure/Microsoft ecosystem to inherently block future routing anomalies [In Progress]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

12-4

RCA - Cognitive Services

Summary of impact: Between 01:50 and 11:30 UTC on 12 Apr 2019 a subset of customers using Cognitive Services including Computer Vision, Face and Text Analytics in West Europe and/or West Central US may have experienced 500-level response codes, high latency and/or timeouts when connecting to resources hosted in this region. 

Root cause: Engineers determined a recent deployment introduced a software regression, manifesting in increased latency across two regions.

Mitigation: The issue was not detected in pre-deployment testing, however, once manually detected, engineers proceeded to roll-back the recent deployment task to mitigate the issue. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Improve pre-deployment tests to catch this kind of issue in the future [In Progress]
  • Improve monitoring to more closely represent production traffic patterns [In Progress]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

9-4

Virtual Machines - North Central US

Summary of impact: Between 21:39 on 9 Apr 2019 and 6:20 UTC on 10 Apr 2019, a subset of customers using Virtual Machines in North Central US may have experienced connection failures when trying to access some Virtual Machines hosted in the region. These Virtual Machines may have also restarted unexpectedly.  Some residual impact was detected, impacting a small subset of recovered Virtual Machine connectivity with the underlying disk storage. 

Root cause: Azure Storage team made a configuration change on 9 April 2019 at 21:30 UTC to our back-end infrastructure in North Central US to improve performance and latency consistency for Azure Disks running inside Azure Virtual Machines. This change was designed to be transparent to customers. It was enabled following our normal deployment process, first to our test environment, and lower impact scale units before being rolled out to the North Central US region. However, this region hit bugs which impacted customer VM availability. Due to a bug, VM hosts were able to establish session with the storage scale unit but hit issues when trying to receive/send data from/to storage scale unit. This situation was designed to be handled with fallback to our existing data path, but an additional bug led to failure in the fallback path and resulted in in VM reboots.

Mitigation: The system automatically recovered. Some of the customer VMs which didn’t auto recover, needed an additional recovery step.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • We have paused further deployment of this configuration change until the underlying bugs are fixed [complete].
  • Fix bugs that caused the background operation to have customer-facing impact [in progress]. 
  • Additional validation rigor to cover the scenario that caused the bugs to be missed in test environment [in progress].


March, 2019

29-3

RCA - SQL Database

Summary of impact: Between 16:45 and 22:05 UTC on 29 Mar 2019, a subset of customers may have experienced the following: 

  • Difficulties connecting to SQL Database resources in the East US, UK South, and West US 2 regions 
  • Difficulties connecting to Service Bus and Event Hubs resources in the East US and UK South regions 
  • Failures when attempting service management operations for App Service resources in the UK South and East US regions 
  • Failures when attempting service management operations for Azure IoT Hub resources 

Root cause: Azure SQL DB supports VNET service endpoints for connecting specific databases to specific VNETs. A component used in this functionality, called the virtual network plugin, runs on each VM used by Azure SQL DB, and is invoked at VM restart or reboot. A deployment of the virtual network plugin was rolling out worldwide. Deployments in Azure follow the Safe Deployment Practice (SDP), which aims to ensure deployment related incidents do not occur in many regions at the same time. SDP achieves this in part by limiting the rate of deployment for any one change. Prior to the start of the incident this particular deployment had already successfully occurred across multiple regions and for multiple days such that the deployment had reached the later stages of SDP, where changes are deployed to several regions at once. This deployment was using a VM restart capability, which occurs without impact to running workloads on those VMs.  

On 5 capacity units across 3 regions, an error in the plugin load process caused the VM to fail to restart. The virtual network plugin is configured as 'required to start', as absence of it prevents key VNET service endpoint functionality from being used on that VM.  The error led to repeated restart attempts causing the VMs to continuously cycle. This occurred on enough VMs across those 5 capacity units that there were not enough resources available to provide placement for all databases in those units causing those databases became unavailable. The plugin error was specific to the hardware types and configurations on the impacted capacity units. 

The 5 capacity units affected included some of the databases used by Service Bus, Event Hub and App Services in those regions which led to the impact to those services. An impacted database in East US was the global service management state for Azure IoT Hub, hence the broad impact to that service.  

Mitigation: Impacted databases using the Azure SQL DB AutoDR capability were failed over to resources in other regions. Some impacted databases were moved to healthy capacity within the region. Full recovery occurred when sufficient affected VMs were manually rebooted on the impacted capacity units. This brought enough healthy capacity online for all databases to become available.  

Next steps: 
We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 

  • Fix the error in deployment, which led to continuous recycling on the specific hardware types and configurations [in progress].
  • Repair deployment block system - it stopped the deployment in each capacity unit before the entire unit became unhealthy, but not soon enough [in progress].
  • Improve detection mechanism - it detected correlated impact at region level, but would have detected faster if each capacity unit was treated separately [in progress].
  • Improve service resiliency for IoT Hub [in progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

28-3

RCA - Data Lake Storage / Data Lake Analytics

Summary of impact: Between 22:10 on 28 Mar 2019 and 03:23 UTC on 29 Mar 2019, a subset of customers using Data Lake Storage and/or Data Lake Analytics may have experienced impact in three regions:

  • East US 2 experienced impact from 23:40 UTC on 28 Mar to 03:23 UTC on 29 Mar 2019.
  • West Europe and Japan East experienced impact from 22:10 to 23:50 UTC on 28 Mar 2019.


Impact symptoms would have been the same for all regions:

  • Customers using Azure Data Lake Storage may have experienced difficulties accessing Data Lake Storage accounts hosted in the region. In addition, data ingress or egress operations may have timed out or failed.
  • Customers using Azure Data Lake Analytics may have seen U-SQL job failures.

Root cause:

Background: ADLS Gen1 uses a microservice to manage the metadata related to placement of data. This is a partitioned microservice where each partition serves a subset of the metadata. Each partition is served by a fault tolerant group of servers. Load across various partitions is managed by an XML config file called partition.config – this a master file which has information about all instances of the microservice; a per region file is generated by a tool. (This tool is applied to all config files, not just partition.config.) Load balancing actions are done in response to the overall load in the region and load on specific partitions. Frequency of load balancing actions is dependent on the overall load in the region. Currently, these load-balancing actions are not automated.

All (code and config) microservice deployments are staged and controlled such that deployment goes to a few machines in a region then to all the machines in a region before moving to the next region. A software component called watchdogs is responsible for testing the service continually and raising errors, which will stop a deployment after the first scale unit or two and revert the bad deployment. The watchdogs can also raise alerts that result in engineers being paged. Moving to next region requires success of deployment in the current region AND approval of the engineer.

What happened: Some of the microservice instances across different regions needed balancing of load to continue to provide best experience and availability. An engineer made changes to the global partition.config file for the identified regions and triggered deployment using the process described above. After observing success in a canary region, the engineer approved deployment in all remaining regions. After deployment completed successfully, the engineer received alerts in two regions: East Japan and West Europe.

Investigation revealed a syntax error in the partition.config. The tool which generates this per region config file, deleted the previous version of the region specific partition.config file and failed to generate a new region specific partition.config file. This did not cause any problem for the metadata service and the deployments succeeded. But later, when for unrelated reasons a new metadata service Front End (FE) process would start, the missing partition.config would cause FE to crash. The deployment in the canary region and other regions succeeded because there were no FE starts so the errors were not seen.

Mitigation: The engineer reverted the bad syntax error in the partition.config file. This new version of partition.config fixed the syntax error, mitigating those two regions as FEs stopped crashing. But this revealed a logic error specific to US East2 region in the partition.config which now caused failures in that region until the engineer fixed that error as well restoring the service availability.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Mandatory test run automatically at submit time, that sanity-checks partition.config. This test would catch both the syntax error and the logic error.
  • Hardening the config deployment mechanism, so that it has built-in delay between regions instead of manual approvals.
  • Enhance the watchdogs so that they catch more errors and cause deployments to fail automatically and revert.
  • Enhance microservice logic to deal more gracefully with errors in partition.config.
  • Fix the tool that generates per region config file for the issue that caused it to delete the output file; instead have it raise an error to fail the deployment.
  • Move partition.config to a data folder with separate file for each region, so that an error in one region doesn’t affect other regions.


Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

27-3

RCA - Service Management Failures - West Europe

Summary of impact: Between approximately 15:20 UTC on 27 Mar 2019 and 17:30 UTC on 28 Mar 2019, a subset of customers may have received failure notifications when performing service management operations such as create, update, deploy, scale, and delete for resources hosted in the West Europe region.

Root cause and mitigation:

Root Cause: Regional Network Manager (RNM) is a core component of the network control plane in Azure. RNM is an infrastructure service that works with another component called the Network Service provider (NRP) to orchestrate the network control plane and drive the networking goal state on host machines. Days leading up to the incident, peak load in RNM’s partition manager sub-component had been increasing steadily due to organic growth and load spikes. In anticipation of this, the engineering team had prepared a code improvement to the lock acquisition logic to enhance the efficiency of queue draining and improve performance. On the day of the incident, before the change could be deployed, the load increased sharply, concentrating on a few subscriptions. This pushed RNM to a tipping point. The load caused operations to time out, resulting in failures. Most of the load was concentrated on a few subscriptions, leading to lock contentions where one thread was waiting on the other, causing a slow drain of operations. The gateway component in RNM started to aggressively add the failures back in to the queue as retries, leading to a rapid snowball effect. Higher layers in the stack such as ARM and Compute Resource Provider (CRP) further aggravated load with retries.

Mitigation: To mitigate the situation and restore RNM to its standard operating levels, the reties had to be stopped. A hotfix to stop the gateway component in RNM from adding retry jobs to the queue was successfully applied. In addition, the few subscriptions that were generating peak load were blocked from sending control plane requests to West Europe. Timeout value to obtain locks was extended to help operations succeed. As a result, RNM recovered steadily and the load returned to operating levels. Finally the originally planned code change was rolled out to all replicas of the RNM, bringing RNM back to its standard operating levels and providing it the ability to take higher loads and improving its performance.

Next steps:
We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. This includes (but is not limited to):

  • Improve lock contention in RNM
  • Improve RNM performance and scale out capacity with enough headroom
  • Mechanisms to throttle workload before RNM service hits tipping point

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey at

February, 2019

27-2

RCA - USGov Virginia - Service Availability

Summary of impact: Between 07:38 and 09:50 EST on 27 Feb 2019, a subset of customers may have experienced degraded performance or timeouts while accessing Azure resources.

Root cause: During routine electrical equipment maintenance at a datacenter, the equipment responsible for load transfer to our redundant power source failed, causing temporary power loss to a subset of racks and devices within the US Virginia data center. This resulted in cascading impact to dependent Azure services.

During this event, a STS (static transfer switch) failed during a load transfer causing the load to rapidly shift back to its primary source, tripping a circuit breaker to prevent damage to the equipment. The dual failure resulted in a drop in power to both feeds powering the server equipment in part of the data center.

Mitigation: Site engineers were able to bring up the redundant power system and restore power to the affected racks and devices while repairs were made to the defective component, which was then brought back online. Recovery to dependent services was done so manually, and engineers subsequently confirmed mitigation once connectivity was fully restored. Engineers actively monitored the restoration process, and full service restoration was confirmed at 09:50 EST, although most services would have recovered before this time.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. This includes, but is not limited to:

  • Review the pre-checks and validation used on electrical equipment prior to any maintenance and add steps to validate the equipment functionality [in progress]

Provide feedback: Please help us improve Azure customer communications experience by taking our survey

22-2

RCA - Virtual Machines (Classic)

Summary of impact: Between 7:18 UTC and 10:50 UTC on 22 Feb 2019 a subset of customers using Virtual Machines (Classic) or Cloud Services (Classic) may have experienced failures or high latency when attempting service management operations. Retries would have been successful for customers.

Some customers may also have experienced downstream impact to API Management and Backup.

Root cause and mitigation: The Azure Service Management layer (ASM), that manages Classic VMs and Classic Cloud Services, is composed of multiple services and components. During this event, ASM Front ends were running low on available resources, causing some incoming requests to timeout. Engineers established that a platform service update introduced this regression that surfaced only at high incoming traffic volume. The ASM update went through the standard extensive testing and the fault did not manifest itself when the update transited through first, the pre-production environments, and later through the first Production slice where it baked for multiple days. There was no indication of such a failure in those environments during this time.

As a mitigation, the Front Ends were scaled out which restored the service health.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Refine validation and service upgrade processes to prevent future occurrences with customer impact [complete].
  • Enhance telemetry for detecting machine-level resource exhaustion to prevent impact to customer functionality [complete].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

  
20-2

RCA - SQL Services - West Europe

Summary of impact: Between 09:40 UTC and 17:15 UTC on 20 Feb 2019 a subset of customers using SQL Services (inclusive of Azure DB for MariaDB, MySQL, and PostgreSQL, SQL DB, and SQL Data Warehouse) in West Europe experienced issues performing service management operations and/or experienced service availability issues following scaling operations. Symptoms may have included but were not limited to:
• Service management operations returning failure notifications
• Server and database create, drop & scale operations may result in "deployment failed" errors
• Failures when creating databases through SQL Script
• Databases may become unavailable after performing scaling operations.

Note: This issue was impacting all types of SQL service deployments (e.g. Elastic Pool, Single and Managed Instances).

Root cause and mitigation: Engineers observed that the SQL Control Plane reached an operational threshold, causing service management for dependent services and service availability failures for scale operations.  Logic that selects a ring, a unit of capacity within a region, to place a service instance during creation or SLO change could incorrectly pick the same overloaded ring for incoming placement operations. It would correctly recognize rings close to full capacity and initiate the placement retry. However it would forget prior selection, restart the whole process from the beginning and reconsider a prior rejected ring again. Engineers manually offloaded a backlog of operations, which allowed traffic to resume and mitigated the issue. The mitigation involved removing rings close to full capacity from the selection list first manually and eventually through automation.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
• Introduce alerts for this type of the issue
• Fix logic to avoid in order ring traversal
• Introduce stochastic scheme to pick up next available ring for placement
• Create tensile tests simulating the situation to verify the solution and avoid regressions going forward

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey