Cronologia stato di Azure

Prodotto:

Area:

Data:

dicembre 2017

13/12

Service Bus and Event Hubs - Australia East

Summary of impact: Between 04:00 and 06:40 UTC on 13 Dec 2017, a limited subset of customers using Service Bus and Event Hubs in Australia East may have experienced intermittent issues when connecting to resources from the Azure Management Portal or programmatically. Services offered within Service Bus, including Service Bus Queue and Service Bus Topics may have also been affected.

Preliminary root cause: Engineers determined that certain instances of a front-end service responsible for processing Service Bus and Event Hub operation requests had reached an acceptable operational threshold, preventing requests from completing as timely as expected.

Mitigation: Engineers scaled out new instances to optimize request processing, and monitored front-end telemetry to confirm mitigation.

Next steps: Engineers will continue to investigate to establish the full root cause.

6/12

App Service - West Europe

Starting at 11:58 UTC on 04 Dec 2017, engineers identified that a limited subset of customers using App Service in West Europe may have received HTTP 500-level response codes, experience timeouts or high latency when accessing App Service (Web, Mobile and API Apps) deployments hosted in this region. Engineers have observed improvements after applying extended mitigation steps. Due to the limited scope of current impact, we will be providing further communications directly to those customers experiencing the issue via their management portal ().

novembre 2017

21/11

RCA - Storage Service Management - Multiple Regions

Summary of impact: From approximately 18:28 to 19:45 UTC and also 21:09 to 21:27 UTC on 21 Nov 2017, a subset of customers may have received failure notifications when performing Service Management operations when attempting to manage their Storage accounts. Retries may have been successful. Existing storage account data operations, such as read, write, update, and delete, were not impacted. Azure Monitor customers may have also seen impact to API calls to turn on diagnostic settings. Attempting to list keys for Storage accounts using Azure Key Vault may have also failed during this time.

Root cause and mitigation: The Storage Resource Provider (SRP) service handles storage account management operations for all regions (create/update/delete/list). As part of the service management, SRP also runs a number of background jobs which validate the consistency of our metadata and data structures. These scrubbers run periodically, and validate the service state. One of the scrubbers which is used to validate the information stored in the service and its geo replicated backup had a bug which resulted in the slowdown of the main request processing in SRP. This slowdown caused the concurrent request count to build up in the SRP and resulted in aggressive throttling of customer requests. During the first throttling event between 18:28 to 19:45 UTC, the engineering team turned down a number of other background operations which stabilized the service temporarily. Since the scrubber which was the culprit was still enabled, it caused throttling to relapse between 21:09 to 21:27 UTC . The engineering team was then able to identify the source of the problem and disabled the scrubber which was causing the problem and it restored the service health.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Transform SRP to be a regional service so that the impact of bugs is minimized to regional impact as oppose to multi-regional [IN PROGRESS].
2. Fix the bug which caused the issue [COMPLETE].

Provide feedback:

19/11

RCA - Network Infrastructure - North Central US

Summary of impact: Between 05:37 and 11:45 UTC on 19 Nov 2017, a subset of customers using Virtual Machines (VMs), Storage, HDInsight, Visual Studio Team Services (VSTS) or Azure Search in North Central US may have experienced high latency or degraded performance when accessing resources hosted in this region. Engineers investigated and determined the root cause was due to increased packet drops in our network infrastructure, which manifested in increased network retries and subsequent impact to downstream services.

Workaround: Customers may choose to leverage Availability Sets to provide additional redundancy for their applications. To learn more about this and other best practices for application resiliency, please refer to the Resiliency checklist at the following link: . Azure Advisor also provides personalized High Availability recommendations to improve application availability. Please refer to the following link for additional information on Azure Advisor: .

Root cause and mitigation: To achieve redundancy, Microsoft builds networks with multiple routers in each layer of the network. When one router in a layer drops packets, a subset of the flows through the network are affected. Applications that retry failed requests or failed sessions on a new TCP or UDP connection, using different port numbers, will likely see success on the retry as the new flow will likely be handled by another router in the affected layer. For this issue, a newly activated router that provides connectivity into and out of the region experienced a hardware fault, potentially affecting up to 12% of traffic flow. Auto-remediation did not occur due to incorrect severity auto-categorization of the hardware fault. Microsoft has automatic systems that use synthetic traffic to locate sources of packet loss, but in this instance the systems did not succeed in identifying the router that had a hardware fault. Engineers were required to use manual methods to identify the faulted router and isolate it, which increased the time to mitigate for the incident.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Updating alarming to account for this new class of device fault scenarios. [In Progress]
2. Enabling auto-mitigation for this class of device fault. [In Progress]
3. Improving synthetic telemetry to maintain and leverage accurate state of Networking devices. [In Progress]

11/11

RCA - SQL Database - North Europe and West Europe

Summary of impact: Between 04:45 and 18:51 UTC on 11 Nov 2017, a subset of customers using SQL Database in North and West Europe may have experienced issues provisioning new databases and managing existing databases via the Azure Management Portal () and Powershell. During this time, customers may have also experienced high latency resulting in errors or timeouts ("Gateway Timeout"). Downstream impact may also have been experienced by HDInsight customers who may have seen deployment failure notifications for the creation of new HDInsight clusters in these regions. This did not affect the availability of existing resources hosted in these regions.

Root cause and mitigation: A race condition exhausted shared resources on SQL scale units in North Europe and West Europe and may have caused increased latency when customers using Azure SQL Database performed service management operations in the regions. Azure SQL Database has announced a preview feature of Transparent Data Encryption (TDE) with Bring Your Own Key (BYOK). This enables customers to have control of the keys used for encryption at rest with TDE by storing these master keys in Azure Key Vault. Engineers discovered this feature created an unique cycle which caused a cross scale unit validation workflow to continuously check the same validation between the two scale units. This resulted in an additional number of service requests preventing all service operations to process successfully in the regions. Azure engineers mitigated the impact of the incident by: 1) Disabling the specific check, this will also prevent any further impact from this defect across the platform. 2) Removing the excessive duplicate requests to free up shared resources and allow service management operations to succeed.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Improve alerting and monitoring to detect abnormal traffic. [In progress]
2. Improve automated throttling to prevent the workflow from causing excessive requests. [In progress]

10/11

RCA - Virtual Machines and Redis Cache - North Europe and West Europe

Summary of impact: Between 04:30 and 09:45 UTC on 10 Nov 2017, a subset of customers using Virtual Machines, Redis Cache, or Azure Load Balancer (Internal Load Balancer VIPs) in North Europe and West Europe may have experienced difficulties connecting to resources hosted in these regions.

Root cause and mitigation: As part of our Safe Deployment Principles, a change was being made to one of our Early User Acceptance Environments. This had already been performed in other testing and development environments without issues. The change being made was to migrate this EUAP environment to use the next generation of Azure Load Balancer service (ALB). The migration worked as expected for the EUAP environment, but due to a tooling bug, two production regions were partially impacted at the time of the change. ALB hosts both external Public IP VIPs and Azure Internal Load Balancer VIPs which consist of reusable ranges of IP addresses that are only available within a region. The bug resulted in withdrawing an IP range from the North Europe and West Europe regions, which resulted in some previously active Internal Load Balancer (ILB) VIPs becoming unrouteable as they were no longer advertised via Border Gateway Protocol (BGP) inside those regions. Once detected, the IPs were added back to the regions, and steps were taken to restore the subset of VIPs impacted in the regions.

Next steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure that such incidents do not occur in the future, and in this case it includes (but is not limited to):

1. Investigate the tooling bug and remediate. [In Progress]
2. Improve monitoring and alerting when such events arise. [In Progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

9/11

DevTestLab - Multi Region

Summary of impact: Between 18:00 UTC on 06 Nov 2017 and 02:09 UTC on 09 Nov 2017, a subset of customers using Azure DevTest Labs may have been unable to view a subset of existing Labs resources in your subscription. This issue did not impact the availability of these existing Labs.

Preliminary root cause: Engineers determined that instances of a backend service became unhealthy, preventing view requests from completing.

Mitigation: Engineers manually restored the backend service which has mitigated the issue.

Next steps: Engineers will continue to investigate to prevent future occurrences. If customers experience residual impact, please refer to the additional details in the Azure Management Portal for guidance.

6/11

RCA - Network Infrastructure - North Central US

Summary of impact: Between 19:33 UTC and 21:07 UTC on 6 Nov 2017, a subset of customers in North Central US experienced degraded performance, network drops, or timeouts when accessing Azure resources hosted in this region. A subset of customers based in the region may have experienced issues accessing non-regional resources, such as Azure Active Directory and the Azure portal. In addition, a subset of customers using App Service / Web Apps in South Central US would have experienced similar issues due to a bug discovered in how Web Apps infrastructure components utilize North Central US resources for geo-redundancy purposes. During a planned capacity addition maintenance window to the North Central US region, a tag was introduced to aggregate routes for a subset of the North Central region.  This tagging caused affected prefixes to be withdrawn from regions outside of North Central US. Automated alerting detected the drop in availability  and engineering teams correlated the issue with the capacity addition maintenance, however automated rollback did not mitigate the issue. The rollback steps were manually validated which fixed the route advertisement and restored the connectivity between the North Central US facilities.

Root cause and mitigation: A capacity augmentation workflow in a datacenter in the North Central US region contained an error that resulted in too many routing announcements being assigned a Border Gateway Protocol (BGP) community tag that should have only been used for routes that stay within the North Central US region.   When aggregate IP prefixes intended for distribution to the Microsoft core network were incorrectly tagged for local distribution only, it interrupted communication between this datacenter and all our other Azure regions. The addition of new capacity to a region has been the cause of incidents in the past, so both automated systems and manual checks are used to verify that there is no impact.  The tool used in this turn up did attempt to rollback, but due to a second error the rollback workflow also applied the incorrectly tagged prefix policy such that the rollback did not resolve the problem. Because rollback did not resolve the incident, engineers working to resolve the incident lost time looking for other potential causes before determining the true root cause and fully rolling back the capacity addition work.  This increased the time required to mitigate the issues. 

Workaround: During this incident, customers' applications with geo-redundancy would have allowed the application to continue without impact or would have minimized the impact. For further information, please visit  for best practices for cloud applications and design patterns.

Next steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure that such incidents do not occur in the future, and in this case it includes (but is not limited to):

1. Any scheduled work changing routing policy to be modeled in CrystalNet -  .  [In progress]
2. Update verification steps for routing policy changes to include checks from other regions.  [In progress]
3. Isolate the dependency upon the geo-redundant Web Apps resources such that they cannot impact the health of the primary resources. [In progress]

5/11

SQL Database - UK West

Summary of impact: Between 04:07 and 08:46 UTC on 05 Nov 2017, a subset of customers using SQL Database in UK West may have experienced difficulties connecting to resources hosted in this region. New connections to existing databases in this region may have resulted in errors or timeouts. Customers may have also received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region.

Preliminary root cause: Engineers determined that a backend storage resource became unhealthy, causing connectivity and Service Management issues for some customers.

Mitigation: Engineers performed an update to restore the backend storage resource, thus mitigating the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

2/11

RCA - Storage – Service Management Operations

Summary of impact: Between 11:40 and 16:48 UTC on 02 Nov 2017, a subset of customers may have experienced issues with Service Management functions (Create, Update, Delete, GetAccountProperties etc.) for their Azure Storage resources. Storage customers may have been unable to provision new Storage resources or perform service management operations on existing resources. Other services with dependencies on Storage may have also experienced impact such as Virtual Machines, Cloud Services, Event Hubs, Backup, Azure Site Recovery, Azure Search and VSTS Load Testing. The impact for this issue was limited to Service Management functions. Service Availability for existing resources would not have been affected. Engineers received alerts and investigated the issue. The issue was understood due to high disk space utilization triggered by staging of OS updates under unexpected circumstance, which resulted in impacting the checkpoint processes of Storage Resource Provider (SRP) services. The incident was mitigated by removing unexpected additional staged images to allow the checkpoint processes of SRP to be succeeded in time.

Root cause and mitigation: As part of the scheduled monthly Guest OS update for the Azure Infrastructure, OS images were staged to each scale unit. During the October OS update cycle, Engineers detected the initial scheduled OS build had a .NET application compatibility issue after the build was staged to a subset of scale units across production. The staging of this build was subsequently stopped but the image was not removed from the staging list. This resulted in an additional size of images being staged to each node in the scale unit, resulting in a low disk space situation on some nodes. Storage Resource Provider (SRP) service handles storage account management operations for all regions (create/update/delete/list). SRP is a Paxos based service and uses disk for state checkpointing. Due to an increase in disk space utilization triggered by staging of OS updates above, the checkpointing process failed, which is critical for service operation. To mitigate the issue, engineers freed the required disk space, allowing state checkpointing to succeed and resume normal processing for all service management operation request. Updates for this incident were regularly communicated via . Due to a process error, a subset of customers relying on service health alerts from Azure Monitor or using the Azure Service Health experience in the Azure Management portal would not have been notified. The Azure service communications team sincerely apologizes for any inconvenience this may have caused. Customers are encouraged to enable Azure Monitor service health alerts for improved incident notifications. Details are available at the following link: .

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 1. Implement automation for greater resiliency for state checkpointing process. [In progress]. 2. Storage Resource Provider resiliency improvements, including regional resiliency to limit future impact scope. [In progress]. 3. Review OS staging process with additional reverting scenarios to minimize an impact during OS staging. [In progress]. 4. Enhance tooling to provide more reliable customer communications across all channels. [In progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey .