Ignora esplorazione

Cronologia stato di Azure

Prodotto:

Area:

Data:

novembre 2018

14/11

Azure DevTest Labs - Mitigated

Summary of impact: Between 20:29 and 22:28 UTC on 14 Nov 2018, a subset of customers using Azure DevTest Labs may have received failure notifications when attempting to access their Labs via the Azure Portal.

Preliminary root cause: Engineers determined that a recent deployment task contained an update which caused calls to an internal API to fail.

Mitigation: Engineers rolled back the recent deployment task to mitigate the issue. 

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.
13/11

App Service & Function Apps

Summary of impact: Between 11:10 and 12:25 UTC on 13 Nov 2018, you were identified as a customer using App Service and/or Function Apps who may have received intermittent HTTP 500-level response codes, have experienced timeouts or high latency when accessing App Service (Web, Mobile and API Apps) deployments hosted in these regions. Impacted customers may have also seen issues with their Azure App Service Scaling settings.

Preliminary root cause: Engineers determined that unhealthy instances of a backend application caused a subset of servers to become unstable, preventing requests from completing.

Mitigation: Engineers performed a hotfix to patch these servers, in order to mitigate the issue. 

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.
8/11

Azure App Service and Functions App - Mitigated

Summary of impact: Between 22:48 UTC on 08 Nov 2018 and 13:30 UTC on 09 Nov 2018, a subset of customers using App Services may have experienced errors when accessing the Azure Functions blade, or experienced issues when accessing app settings in the portal .
 

Preliminary root cause: Engineers identified a recent deployment to a specific regional instance of the Functions portal was the root cause. 
 

Mitigation: Engineers deployed a platform hotfix to mitigate the issue. 
 

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

2/11

Azure Portal Timeouts and Latency

Summary of impact: Between 23:05 UTC on 02 Nov 2018 and 00:21 UTC on 03 Nov 2018, some customers may have experienced high latency or timeouts when viewing resources or loading blades through the Azure Portal ().

Preliminary root cause: Engineers determined that a recent deployment task introduced an updated DNS record that caused the backend service hosting portal blades to become unhealthy, preventing requests from completing.

Mitigation: Engineers performed a configuration change to revert the impacting update.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

ottobre 2018

27/10

Virtual Machines/VM Scale Sets - West US 2

Summary of impact: Between 02:50 and 10:37 UTC on 27 Oct 2018, a subset of customers using Virtual Machines and/or Virtual Machine Scale Sets in West US 2 may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region. 

Preliminary root cause: Engineers determined that some instances of a backend service responsible for interservice communication became unhealthy which in turn, caused requests between Storage and dependent resources to fail. 

Mitigation: Engineers performed a change to the backend service to achieve mitigation. Platform telemetry was then observed and service team engineers from affected services confirmed all requests completed successfully. 

Next steps: Engineers will continue to investigate to establish the full root cause to prevent future occurrences.

24/10

RCA - Networking in West US

Summary of impact: Between 22:40 UTC on 24 Oct 2018 and 00:03 UTC on 25 Oct 2018, a subset of customers may have experienced degraded network performance and/or difficulties connecting to resources in the West US region. 

Root cause: A network device connecting a datacenter in the West US region experienced a fault during routine fiber maintenance. Azure Networking lost a subset of capacity between the affected data center and other facilities in the West US region.  The failed network device also began silently dropping a portion of the flows that traversed it. 

Mitigation: The incident was mitigated by rebalancing traffic across the remaining links.  The incident was resolved via restoration of the fiber and the optical system.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. Steps specific to this incident include:

Evaluate faster link level bidirectional failure detection [in progress]
Evaluate escalated timeline for higher capacity links for this data center [in progress]
Expand existing black hole detection scenarios [in progress]
Review process and validations after fiber plant maintenance [in progress]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

17/10

Latency between North Europe and North America

Summary of impact: Between 14:30 and 16:35 UTC on 17 Oct 2018, a subset of customers with traffic going between the North Europe datacenter and North America regions may have intermittently experienced degraded performance, and in some rare instances network drops or timeouts, when accessing Azure resources.

Preliminary root cause and mitigation: A fiber cut was determined as the underlying root cause, and engineers balanced traffic to minimize any customer impact. The affected fiber was primarily responsible for replication traffic, so customer impact would have been minimal.

Next steps: Engineers are engaged as the fiber repair is underway. Once the site is stable and thorough testing has been completed, engineers will route traffic back through the existing paths.

16/10

RCA - Azure Service Availability - France Central

Summary of impact: Between 13:57 and 16:45 UTC on 16 Oct 2018, a subset of customers may have experienced difficulties connecting to resources in the France Central region. This impact was related to a number of Storage and Compute resources that experienced an unexpected power-down during this time.

Root cause and mitigation: A corrective maintenance activity related to the fire alarm system was taking place in one of the isolated zones (Colos) in the datacenter at the start of this event. While working on the system, the maintenance engineer inadvertently triggered the fire alarm for this Colo, which in normal circumstances would have alerted staff, but would not have caused any impact to running services in the Colo. However in this instance a separate legacy fire suppression system, with an automated power shutdown process, was also triggered by the fire alarm. The activation of this system triggered a general power shutdown process, and a number of storage and compute resources in this single Colo became unavailable as a result.

This legacy system was configured, pursuant to previous French insurance requirements, to automatically shut down power in the event of a fire suppression activation being triggered, and also to prevent backup power supplies from kicking in. This system was not supposed to be active, or able to trigger the shutdown of the power systems in this Colo.

The Engineers onsite performed a manual restoration of power in the Colo, and then brought the impacted Storage and Compute resources back online in a structured manner to ensure full data resilience, and complete service restoration. Engineers actively monitored the restoration process, and full service restoration was confirmed at 16:45 UTC, although most services would have recovered before this time.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. Steps specific related to this incident include:

1. A complete review of the maintenance processes for fire systems in this region to eliminate false alarm triggers during these processes.
2. A full audit of all datacenters in this region. Looking for legacy fire suppression systems which may still be active or in existence.
3. The removal of the links to the power supply control systems from fire suppression systems removing the bridge that will allow an alert on one system impacting another.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

13/10

Storage - East US

Summary of impact: Between 01:00 and 06:50 UTC on 13th Oct 2018, a subset of customers in the East US region may have experienced difficulties connecting to resources hosted in this region. This would include Storage accounts and Virtual Machines. Some customers would have seen improvement starting at around 03:30 UTC, with the vast majority of impact mitigated by 06:00 UTC. Full Storage recovery occurred by 06:50 UTC. Limited impact may have been observed for a small number of customers using Azure Log Analytics, App Services, Logic Apps, HDInsight, Azure Site Recovery, Application Insights, Azure Data Factory, Azure Automation, PostgreSQL, MySQL and Azure Backup in East US.

Root cause and mitigation: At 00:10 UTC, a single busy Storage cluster in the US East region experienced a series of storage software role failures due to abnormally high resource utilization. Azure Storage has redundancy designed in, and normally roles will recover without customer impact. Unexpectedly, a small number of the failed roles did not recover automatically. This increased the load on the remaining servers in the cluster, causing further increased resource use and more failures. Customer impact began at around 01:00 UTC. Eventually the number of failed roles reached a critical point where the storage cluster was no longer able to sustain most customer traffic. Engineers made software configuration changes on the affected cluster to reduce resource utilization, and took recovery actions on Storage role instances that were failing to recover automatically. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform to help ensure such incidents do not occur in the future. In this case it includes, but is not limited to: 

1. A software fix has been made to better limit resource use in scenarios where several storage roles fail.  This will be rolled out following our normal safe deployment practices.

2. Configuration changes have been made to other similar storage clusters in the fleet, prior to the code fix above being deployed.

3. Investigation into the failed automatic storage role recoveries is ongoing.

4. Additional alerting will be added to alert the engineering team to situations where roles are not recovering as expected.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

11/10

Azure Portal and DevOps Accessibility

Summary of impact: Between 15:59 UTC and 18:35 UTC on 11 Oct 2018, A subset of customers may have experienced intermittent issues when attempting to sign-in to the Azure Portal and DevOps services.

Preliminary root cause: Engineers determine that a CDN endpoint that is responsible for processing authentication requests became unreachable.

Mitigation: Engineers responded to alerts and were in the process of deploying new CDN endpoints for authentication requests. This had been deployed in staging environments to validate mitigation before this was deployed to production. During the validation, engineers became aware that some customers may not have had these new endpoints whitelisted and so engineers began to troubleshoot this issue before any deployment into production. During the investigation, it was validated that the existing endpoints had become accessible and stable. Once engineers validated the stability and tested the existing CDN endpoints, engineers routed traffic back to the original CDN endpoints.

Next steps: Engineering teams are closely monitoring the existing CDN endpoints for any unusual activity. They will work to understand why the CDN endpoint became unreachable and why redundant mechanisms were not invoked automatically.

11/10

App Service - East Asia

Summary of impact: Between 09:32 and 09:58 UTC on 11 Oct 2018, a subset of customers using App Service in East Asia may have received HTTP 500-level response codes and experienced timeouts or high latency when accessing App Service (Web, Mobile and API Apps) deployments hosted in this region. 

Root cause: The Microsoft Azure team identified the issue was due to a dependency that the Web App service takes on Azure Storage. The application site content is hosted within Web App platform are backed by Azure Storage in a durable manner.  The application accesses the site contents as file shares (this model maximizes application compatibility). Additionally, the Web App platform employs a feature to improve the impact to customers when the platform detects latency issues within the hosted primary storage volume. This is accomplished by making use of read-only (R/O) replicas (standby volumes) hosted on Azure storage. Web App Engineers collaborated with the Storage team and discovered a previously unknown issue due to regional load balancing of the storage cluster the Web App primary and secondary R/O replicas that application was hosted on. 

Mitigation: This issue was detected and automatically resolved by the Azure platform's self-healing process. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform to help ensure such incidents do not occur in the future. In this case it includes, but is not limited to:

• Fine tuning the self-heal mechanism to pro-actively identify changes due to load balancing and heal faster. [in progress]

8/10

Availability Issues with Application Insights Portal

Summary of impact: Between 23:20 on 08 Oct 2018 and 00:20 UTC on 09 Oct 2018, a subset of customers using Application Insights may have experienced issues connecting to their Application Insights resources. Application Insight portal blades may have not loaded for some customers.

Preliminary root cause: Engineers identified a configuration change that caused a backend service to become unhealthy, impacting the availability of the Application Insights portal for some customers.

Mitigation: Engineers failed over to a healthy backend service to mitigate impact for customers.

Next steps: Engineers will continue to investigate the full root cause to prevent future occurrences.

8/10

RCA - Service Bus - East US and West US

Summary of impact: Between 21:00 UTC on 08 Oct 2018 and 01:00 UTC on 10 Oct 2018, a subset of customers using Service Bus in East US 2 and West US may have experienced intermittent timeouts on runtime and management operations. Retries may have been successful for some customers. 

Root cause and mitigation: Service Bus and Event Hubs recently added support for Service Endpoints to support VNet scenarios. For this support to work, a VM Extension which provides service tunneling for Azure Services was added to the Service Bus and Event Hubs gateway VMs. The Service Bus and Event Hub gateway VMs also are configured with Direct Server Return (DSR) endpoints.

Recently, the service tunneling extension was upgraded to v4.1. Under certain conditions, the way the service which configures DSR and the Service Tunneling extension interact can cause the Service Bus and Event Hubs gateway to lose the ability to receive load balanced traffic. On this incident, that condition was triggered when the Service Bus VMs received the monthly OS Patch. Once a VM was impacted, the load balance traffic was not recovered until the VMs were manually patched to fix the misconfiguration.

Next steps: We sincerely apologize for the impact to our affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Fix - Modify Azure Guest Agent to pick a predictable loopback adapter to configure direct server return
2. Detect - Increase priority of alerting from outside monitoring

8/10

Azure DevOps

Summary of impact: Between 11:40 and 13:30 UTC on 08 Oct 2018, a subset of Azure DevOps customers may have experienced error messages, latency, and/or authentication issues. For more information regarding this issue, please refer to:

In addition, a small subset of Azure Portal customers may have experienced issues logging into the Azure portal.

Preliminary root cause: Engineers determined that a network device experienced a hardware fault and that network traffic was not automatically rerouted.

Mitigation: Engineers took the faulty network device out of rotation and rerouted network traffic to mitigate the issue.

4/10

Azure DevOps

Summary of impact: Between 17:47 and 18:40 UTC on 04 Oct 2018, users with Azure DevOps organizations in South Central US may have experienced errors while using the service.

Preliminary root cause: Engineers determined that this was caused by a networking event that impacted communication between one Azure DevOps scale unit in South Central US from the rest of the world.

Mitigation: After the network event was self-recovered, Azure DevOps performed a manual action to reset our web servers which expedited the recovery from the network incident.

Next steps: Azure DevOps and Networking teams will continue to investigate to establish the full root cause and prevent future occurrences.

3/10

Azure DevOps

Summary of impact: Between 16:45 and 18:15 UTC on 03 Oct 2018, a subset of customers using Azure DevOps may have been unable to access DevOps resources. Additional information can be found at

Preliminary root cause: Engineers determined that a network device experienced a hardware fault, causing a DevOps backend service to become unavailable.

Mitigation: Engineers took the faulty network device out of rotation and rerouted network traffic to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

2/10

ExpressRoute Circuit - Australia Southeast

Summary of impact: Between 01:00 UTC and 04:45 UTC on 02 Oct 2018, a subset of customers began facing intermittent packet loss in the Melbourne ExpressRoute location for Australia Southeast region. During the period of degradation, customers may have experienced intermittent packet loss when using an ExpressRoute circuit hosted on the Microsoft Enterprise Edge device (MSEE) experiencing this issue. Access to a Virtual Network and the corresponding resources using a Virtual Network Gateway and services hosted on public and Microsoft peering were also affected when connecting using the ExpressRoute service.

Root cause and mitigation: ExpressRoute, as with other Azure services, is designed and implemented in a highly available manner and architecture. ExpressRoute has multiple redundancies implemented in both software and hardware to ensure maximum uptime for customers. It was identified that an upstream physical port experienced a partial failure causing the connection between the ExpressRoute routers and the Microsoft global network to intermittently drop packets. Due the partial failure, the physical port was not removed from rotation successfully causing impact to customers consuming services.
Automatic detection mechanisms are in place to detect failure modes in the platform. The detection mechanisms triggered an internal alert and engineers engaged to begin troubleshooting. An unrelated issue had previously arisen that did not immediately make this issue apparent, causing a slight delay in mitigation. When engineers pinpointed the issue, the port was removed from rotation and services were immediately restored.

Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case, it includes (but is not limited to):

• Add additional monitoring parameters in our automated detection platform to better classify and alert on intermittent failures [In Progress].
• Modify existing processes to localize on faults and mitigate faster. Add automation where applicable [In Progress].

settembre 2018

30/9

Multiple Services - Korea South

Summary of impact: Between 13:50 and 19:10 UTC on 30 Sep 2018, a subset of Storage customers in Korea South may have experienced difficulties connecting to resources hosted in this region. Some services with dependencies on Storage in the region were also impacted. 

Preliminary root cause: Engineers determined that a single storage scale unit experienced impact to a limited subset of its storage resources, and engineers are still investigating the underlying cause. 

Mitigation: Engineers manually restarted the impacted Storage nodes to mitigate the issue. 

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

26/9

RCA - Multiple Services - Southeast Asia

Summary of impact: Between 09:52 and 12:28 UTC on 26 Sep 2018, a subset of customers in Southeast Asia may have experienced latency or difficulties connecting to Virtual Machines and/or Cloud Service resources hosted in this region. Customers using a number of related services including: App Services, Redis Cache, Application Insights, Stream Analytics, and API Management may have also experienced latency or difficulties connecting to these services in the region.

Workaround: This incident impacted a single scale unit in the Southeast Asia region.  Customers could have reduced the impact of this incident by using VM Scale Sets, or by deploying in more than one region and using Azure Traffic Manager to direct requests to a healthy deployment. 

Root cause and mitigation: In Azure's network architecture, each rack of servers in a data center row connects to 8 network switches located in the middle of the row. This provides 8-way network redundancy at the row level, and makes it very unlikely that failures of the row-level switches will impact customer traffic.

In this incident, there was a failure of the RS-232 serial line aggregator that provides connectivity to the console ports of all 8 row-level switches. The flapping behavior of the console ports triggered a bug in the console line driver of the Operating System kernel running on the 8 row-level network switches, and caused a correlated failure of all 8 row-level switches. The failure of all 8 row-level switches led to loss of network connectivity to all servers in the racks located in that row. That, in turn, caused impact to all services hosted on the servers and caused the reboot of most VMs running on the servers.

Engineers mitigated this incident by restarting the affected row-level switches and recovering impacted downstream services. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. We are taking the following steps to prevent a recurrence of this type of issue:

1. Repair the tty communications subsystem in the operating system running on the switch to be resilient to console line state changes [In Progress]
2. Add code to the switch to recover itself even if the tty communications subsystem fails [In Design]
3. Add alerting to detect switches failing in this manner [Completed]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:


19/9

Log Analytics - Latency, Timeouts and Service Management Failures

Summary of impact:  Between 10:54 and 18:45 UTC on 19 Sep 2018, a subset of customers using Log Analytics and/or other downstream services may have experienced latency, timeouts or service management failures.

Impacted services and experience: 

Log Analytics - Difficulties accessing data, high latency and timeouts when getting workspace information, running Log Analytics queries, and other operations related to Log Analytics workspaces.
Service Map - Ingestion delays and latency.
Automation-  Difficulties accessing the Update management, Change tracking, Inventory, and Linked workspace blades.
Network Performance Monitor - Incomplete data in tests configured.

Preliminary root cause: A backend Log Analytics service experienced a large number of requests which caused failures in several other dependent services.

Mitigation: Engineers applied a backend scaling adjustment to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

15/9

Virtual Machines - Metrics Unavailable

Summary of impact: Between 06:30 UTC on 15 Sep 2018 and 22:35 UTC on 17 Sep 2018, a subset of customers may have experienced difficulties viewing Virtual Machines and/or other compute resource metrics. Metric graphs including CPU (average), Network (total), Disk Bytes (total) and Disk operations/sec (average) may have appeared empty within the Management Portal. Auto-scaling operations were potentially impacted as well. 

Root cause: Scheduled maintenance was being completed on a gateway component within the metric ingestion pipeline.  During this time one instance was taken down, and leaving up another instance to service traffic.  When this transition occurred it was not properly handled by the publishing agent, as the result it failed to continue publication for a subset of customers Virtual Machines metrics.  In parallel, the agent uses a VIP to communicate with the ingestion gateway at the frontend.  When the VIP was failed over, the agent did not properly detect and re-establish connectivity to the functioning instance during the maintenance period.  The gateway did not use reserved IPs, and received a new endpoint upon maintenance completion.  These two triggers caused the agent to reach an invalid state it was not able to recover without manual mitigation.

Mitigation: A change was made to the agent which forced it to recycle and re-establish a proper connection to the gateway component.  Once this occurred, metric publication began flowing as expected.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future.  Steps specific to this incident include:

1. Further localization of the publication pipeline to minimize the potential of broad impact, when future maintenance is performed [In progress].
2. Key monitoring gaps in the pipeline have been identified to more rapidly allow the issues of this nature to be detected and resolved without impact in the future. [In progress]
3. Failure recovery has been improved in the client to avoid long periods of publication to a non-healthy gateway. [In progress]

 
15/9

Metrics Missing - Germany

Summary of impact: Between 06:30 UTC on 15 Sep 2018 and 22:35 UTC on 17 Sep 2018, a subset of customers may have experienced difficulties viewing Virtual Machines and/or other compute resource metrics. Metric graphs including CPU (average), Network (total), Disk Bytes (total) and Disk operations/sec (average) may have appeared empty within the Management Portal. Auto-scaling operations were potentially impacted as well. 

Root cause: Scheduled maintenance was being completed on a gateway component within the metric ingestion pipeline. During this time one instance was taken down, leaving up another instance to service traffic. When this transition occurred, it was not properly handled by the publishing agent, as the result it failed to continue publication for a subset of customers Virtual Machines metrics. In parallel, the agent uses a VIP to communicate with the ingestion gateway at the frontend. When the VIP was failed over, the agent did not properly detect and re-establish connectivity to the functioning instance during the maintenance period. The gateway did not use reserved IPs and received a new endpoint upon maintenance completion. These two triggers caused the agent to reach an invalid state it was not able to recover without manual mitigation. 

Mitigation: A change was made to the agent which forced it to recycle and re-establish a proper connection to the gateway component. Once this occurred, metric publication began flowing as expected. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future.  Steps specific to this incident include:

1. Further localization of the publication pipeline to minimize the potential of broad impact, when future maintenance is performed [In progress].
2. Key monitoring gaps in the pipeline have been identified to more rapidly allow the issues of this nature to be detected and resolved without impact in the future. [In progress]
3. Failure recovery has been improved in the client to avoid long periods of publication to a non-healthy gateway

5/9

RCA - Azure Active Directory - Multiple Regions

Summary of impact: Between as early as 09:00 UTC on 05 Sept 2018 and as late as 05:50 UTC on 10 Sept 2018, a small subset of Azure Active Directory (AAD) customers may have experienced intermittent authentication failures when connecting to resources in the following regions: Japan, India, Australia, South Brazil, and East US 2.

Root cause and mitigation: A recent change to the Azure VM platform resulted in interference with the HTTP settings used by the Azure Active Directory (AAD) proxy service. This led to the failure of a subset of HTTP requests arriving at the affected AAD regions. The issue was initially mitigated by rebooting the proxy servers in the affected regions and reverting to the required AAD settings. The issue was finally resolved by deploying an AAD proxy configuration change that prevented the interference from the platform change.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future:
1. Continuous improvement to the communication and change management processes as related to the Azure platform updates.
2. Continuous improvement to AAD proxy monitoring and alerting mechanisms in order to reduce incidents’ impact and duration.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

4/9

RCA - South Central US

Summary of event:
In the early morning of September 4, 2018, high energy storms hit southern Texas in the vicinity of Microsoft Azure’s South Central US region.  Multiple Azure datacenters in the region saw voltage sags and swells across utility feeds.  At 08:42 UTC, lightning caused a large swell which was immediately followed by a large sag in one of the region’s datacenters, which fell below the required voltage specification for the chiller plant. By design this sag triggered the chillers to power down and lock out, to protect this equipment.  
The expected behavior in this occurrence is for the Mechanical Electrical Plumbing (MEP) management control system to invoke one of the redundant cooling systems in the facility until the chiller plant is able to recover. This is an automated operating procedure and is the standard design in many Microsoft datacenters around the world, having withstood similar situations without any issues.  In the impacted datacenter, the redundant cooling system had been switched to manual mode for invoking the failover of cooling. It was set to a manual failover following the installation of new equipment in the facility where all testing had not been completed.
The manual procedure for failing over the cooling mechanisms is part of our statement of operations and this was executed successfully in one of the other datacenters in the South Central US region, 30 mins prior to this event. Onsite engineers in the impacted datacenter received multiple alarms so were investigating several situations reported in the facility. Cooling alerts which should have been manually actioned were not. Consequently, temperatures began to rise, eventually reaching temperatures that triggered infrastructure devices and servers to shutdown. This shutdown mechanism is intended to protect infrastructure from overheating but, in this instance, temperatures increased so quickly in parts of the datacenter that a number of storage servers were damaged, as well as a small number of network devices. 
While storms were still active in the area, onsite teams took a series of actions to prevent further issues – including transferring the datacenter to generators, thereby stabilizing the power supply.  Initial focus of recovery was the Azure networking infrastructure.  The second focus was to recover the storage servers and the data on these servers.  The decision was made to work towards recovery of data and not fail over to another datacenter, since a failover would have resulted in limited data loss due to the asynchronous nature of geo replication.  Recovering the storage services involved replacing failed infrastructure components, migrating customer data from damaged servers to healthy servers, and validating the integrity of the recovered data. This process took time due to the number of servers that required manual intervention, and the need to work carefully to maintain customer data integrity above all else.

Customer impact:
[1] Impact to resources in South Central US
Customer impact began at approximately 09:29 UTC when infrastructure devices and servers in the datacenter began shutting down as a result of the rising temperatures. This impacted a subset of customers using Azure services that depended on this infrastructure, specifically: Storage, Virtual Machines, Application Insights, Cognitive Services & Custom Vision API, Backup, App Service (and App Services for Linux and Web App for Containers), Azure Database for MySQL, SQL Database, Azure Automation, Site Recovery, Redis Cache, Cosmos DB, Stream Analytics, Media Services, Azure Resource Manager, Azure VPN gateways, PostgreSQL, Application Insights, Azure Machine Learning Studio, Azure Search, Data Factory, HDInsight, IoT Hub, Analysis Services, Key Vault, Log Analytics, Azure Monitor, Azure Scheduler, Logic Apps, Databricks, ExpressRoute, Container Registry, Application Gateway, Service Bus, Event Hub, Azure Portal IaaS Experiences- Bot Service, Azure Batch, Service Fabric and Visual Studio Team Services (VSTS). Although the majority of customers impacted by these services were mitigated by 11:00 UTC on 5 September, full mitigation was not until approximately 08:40 UTC on 7 September when the final storage accounts were brought back online. The extended Storage recovery was due to damaged infrastructure which required repairs while bringing Storage disks back online.  In this incident, customers that had enabled RA-GRS still had read-only access to the geo-replicated copy of the impacted storage accounts.

[2] Impact to Azure Service Manager (ASM)
Insufficient resiliency for Azure Service Manager (ASM) led to the widest impact for customers outside of South Central US. ASM performs management operations for all ‘classic’ resource types. This is often used by customers who have not yet adopted Azure Resource Manager (ARM) APIs which have been made available in the past few years. ARM provides reliable, global resiliency by storing data in every Azure region. ASM stores metadata about these resources in multiple locations, but the primary site is South Central US. Although ASM is a global service, it does not support automatic failover. As a result of the impact in South Central US, ASM requests experienced higher call latencies, timeouts or failures when performing service management operations. To alleviate failures, engineers routed traffic to a secondary site. This helped to provide temporary relief, but the impact was fully mitigated once the associated South Central US storage servers were brought back online at 01:10 UTC on 5 September.

[3] Impact to Azure Resource Manager (ARM)
The ARM service instances in regions outside of South Central US were impacted due to the dependencies called out in [8] below, which includes subscription and RBAC role assignment operations.  In addition, because ARM is used as a routing and orchestration layer (for example, when using ARM templates to interact with ‘classic’ resources) customers may have experienced latency, failures or timeouts when using PowerShell, CLI or the portal to manage resources through ARM that had underlying dependencies on ASM or other impacted services mentioned in this RCA.

[4] Impact to Azure Active Directory (AAD)
As infrastructure shutdown from 9:29 UTC, Azure AD authentication requests routed as expected to sites outside of South Central US.  Requests were successfully being managed until 11:00 UTC when the first signs of degradation occurred for customers located in North America, predominantly due to three factors. First, the AAD autoscale operation that is designed to scale ahead of demand did not complete. The autoscale failure was due to a dependency on the Azure ASM API which, as discussed earlier, was unable to process service management operations.  Second, as AAD sites neared safe utilization thresholds, Quality of Service throttling mechanisms engaged, resulting in authentication failures and timeouts. Third, the engineering team discovered a bug that triggered an unexpected behavior in Office clients which resulted in aggressive retry logic, exacerbating the increased load on the service.

[5] Impact to Visual Studio Team Services (VSTS)
VSTS organizations hosted in South Central US were down. Some VSTS services in this region provide capabilities used by services in other regions, which led to broader impact – slowdowns, errors in the VSTS Dashboard functionality, and inability to access user profiles stored in South Central US to name a few. Customers with organizations hosted in the US were unable to use Release Management and Package Management services. Build and release pipelines using the Hosted macOS queue failed. To avoid data loss, VSTS services did not failover and waitesd for the recovery of the Storage services. After VSTS services had recovered, additional issues for customers occurred in Git, Release Management, and some Package Management feeds due to Azure Storage accounts that had an extended recovery. This VSTS impact was fully mitigated by 00:05 UTC on 6 September. The full RCA for VSTS can be found at .

[6] Impact to Azure Application Insights
Application Insights resources across multiple regions experienced impact. This was caused by a dependency on Azure Active Directory and platform services that provide data routing. This impacted the ability to query data, to update/manage some types of resources such as Availability Tests, and significantly delayed ingestion. Engineers scaled out the services to more quickly process the backlog of data and recover the service.  During recovery, customers experienced gaps in data, as seen in the Azure portal; Log Search alerts firing, based on latent ingested data; latency in reporting billing data to Azure commerce; and delays in results of Availability Tests. Although 90% of customers were mitigated by 16:16 UTC on 6 September, full mitigation was not until 22:12 UTC on 7 September.

[7] Impact to the Azure status page
The Azure status page (status.azure.com) is a web app that uses multiple Azure services in multiple Azure regions. During the incident, the status page received a significant increase in traffic, which should have caused a scale out of resources as needed. The combination of the increased traffic and non-optimized auto-scale configuration settings prevented the web app from scaling properly to handle the increased load. This resulted in intermittent 500 errors for a subset of page visitors, starting at approximately 12:30 UTC. Once these configuration issues were identified, our engineers adjusted the auto-scale settings, so the web app could expand organically to support the traffic. The impact to the status page was fully mitigated by 23:05 UTC on 4 September.

[8] Impact to Azure subscription management
From 09:30 UTC on 4 September, Azure subscription management experienced five separate issues related to the South Central US datacenter impact. First, provisioning of new subscriptions experienced long delays as requests were queued for processing. Second, the properties of existing subscriptions could not be updated. Third, subscription metadata could not be accessed, and as a result billing portal requests failed. Fourth, customers may have received errors in the Azure management and/or billing portals when attempting to create support tickets or contacting support. Fifth, a subset of customers were incorrectly shown a banner asking that they contact support to discuss a billing system issue. Impact to subscription management was fully mitigated by 08:30 UTC on 5 September.

Next steps:
We sincerely apologize for the impact to affected customers. Investigations have assessed the best ways to improve architectural resiliency and mitigate the issues that contributed to this incident and its widespread impact. A detailed forensic analysis of the impacted datacenter hardware and systems is complete, and thorough reviews of datacenter recovery procedures is ongoing.  Several changes are underway:

1. Failure of the cooling redundancy via MEP automation is understood in this incident, and has been remediated – cooling systems in all datacenters in South Central US, and around the world, have been reviewed with no other systems requiring remediation.

2. A review with every Azure service to identify dependencies on the ASM API continues.  Service teams are exploring paths to move these services from ASM to the newer ARM platform. Immediate efforts are to prioritize autoscale operations succeeding in the unlikely event of ASM degradation or failure.

3. Additional improvements to AAD resiliency that would prevent customer impact in the case of a surge in authentication rates are in progress. This is particularly focused on ensuring that the ASM API responsible for the autoscale operation is highly available.

4. Necessary fixes have been released for Microsoft Office clients, to prevent the unexpected surge in authentication traffic.

5. An evaluation of the future hardware design of storage scale units is underway, to increase resiliency to environmental factors. In addition, for scenarios in which impact is unavoidable, we are determining software changes to automate and accelerate recovery.

6. Microsoft takes the responsibility of preserving your data seriously. If there is any chance of recovering the data in the primary region, Microsoft will delay the failover and focus on recovering your data. However, as announced at Ignite 2018, Azure Storage services are planning to allow customers using Geo-Redundant Storage (GRS) to make their own decision about failing over a Storage accounts to its geo-replicated region. This will give customers an option of bringing Storage accounts back online for both read and write access for workloads that can accept a potential risk of data loss due to the nature of asynchronous replication.

7. Impacted customers will receive a credit pursuant to the Microsoft Azure Service Level Agreement, in their October billing statement.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

agosto 2018

20/8

RCA - Storage - West Europe

Summary of impact:
Between 08:40 and 13:43 UTC on 20 Aug, 2018, a subset of customers using Azure Storage may have been unable to create ARM storage accounts in West Europe. Customers may have received 400 HTTP Bad Request responses with an error code of “RequestIsNotValidJson” even for valid, well-formed requests.  Customers attempting to deploy a new VM in the affected regions may have also encountered failures due to this issue. Other services and operations which rely on the creation of storage accounts would have similarly been impacted.  Other storage service management operations were not impacted. Reading and writing data within existing storage accounts was also not affected.

Customer impact workaround:
Customers may have chosen to use an existing account or create a new account in an unaffected region or create classic storage accounts.

Root cause and mitigation:
The incident was caused by a bug in the Storage Resource Provider (SRP). The bug was triggered in the create account path when we were evaluating the scale unit to create the account in. The bug only manifests under specific configuration settings which was only present in West Europe. Due to the bug, customers would have experienced account creation failures. The bug was not found during testing, and the service had been deployed to other production regions without issue before encountering the incident in West Europe. 
Engineers mitigated the issue by making a configuration update. 

Next steps:
We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future.
1) We are working on a fix to make the account creation more robust in the event of a similar bug should they re-occur in the future [in progress].
2) We are also working on completing implementation of per-region flighting of new deployments for Storage with automatic rollback whenever core scenarios are impacted [in progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

20/8

Iot Hub - West Europe and North Europe

Summary of Impact: Between 17:30 UTC on 20 Aug 2018 and 16:42 UTC on 21 Aug 2018, a subset of IoT Hub customers in West Europe and North Europe have may experienced errors, such as 'InternalServerError', when trying to access IoT Hub devices hosted in these regions. Customers may also experience issues when attempting to delete resources/devices.

Root cause and mitigation: A change was deployed that introduced a possibility of threads entering a locked condition and waiting indefinitely, thus failing all subsequent calls to the partition. The problem went undetected in tests as it was very infrequent and was encountered only in specific traffic patterns. Engineers performed a manual reboot of the affected backend service nodes to mitigate the issue. Furthermore, a permanent fix has been identified and deployed to the entire service. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, such as: Improve telemetry to detect and more quickly mitigate issues that arise during deployment activities [in progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey