탐색 건너뛰기

Azure 상태 기록

제품:

지역:

날짜:

2018년 10월

10-13

Storage - East US

Summary of impact: Between 01:22 and 05:50 UTC on 13 Oct 2018, a subset of customers using Storage in East US may have experienced intermittent difficulties connecting to resources hosted in this region. Other services leveraging Storage in the region may have also experienced impact related to this incident.

Preliminary root cause: Availability impact was caused by multiple backend storage nodes going into an offline state due to increased resource usage.

Mitigation: Engineers brought back the storage nodes to restore availability. As engineering continue to investigate the root cause, all Storage deployments have been paused and traffic from background jobs have been stopped to reduce the overall load on the scale unit.

Next steps: The investigation into the root cause continues continues and a full root cause will be published in approximately 72 hours.

10-11

Azure Portal and DevOps Accessibility

Summary of impact: Between 15:59 UTC and 18:35 UTC on 11 Oct 2018, A subset of customers may have experienced intermittent issues when attempting to sign-in to the Azure Portal and DevOps services.

Preliminary root cause: Engineers determine that a CDN endpoint that is responsible for processing authentication requests became unreachable.

Mitigation: Engineers responded to alerts and were in the process of deploying new CDN endpoints for authentication requests. This had been deployed in staging environments to validate mitigation before this was deployed to production. During the validation, engineers became aware that some customers may not have had these new endpoints whitelisted and so engineers began to troubleshoot this issue before any deployment into production. During the investigation, it was validated that the existing endpoints had become accessible and stable. Once engineers validated the stability and tested the existing CDN endpoints, engineers routed traffic back to the original CDN endpoints.

Next steps: Engineering teams are closely monitoring the existing CDN endpoints for any unusual activity. They will work to understand why the CDN endpoint became unreachable and why redundant mechanisms were not invoked automatically.

10-8

Availability Issues with Application Insights Portal

Summary of impact: Between 23:20 on 08 Oct 2018 and 00:20 UTC on 09 Oct 2018, a subset of customers using Application Insights may have experienced issues connecting to their Application Insights resources. Application Insight portal blades may have not loaded for some customers.

Preliminary root cause: Engineers identified a configuration change that caused a backend service to become unhealthy, impacting the availability of the Application Insights portal for some customers.

Mitigation: Engineers failed over to a healthy backend service to mitigate impact for customers.

Next steps: Engineers will continue to investigate the full root cause to prevent future occurrences.

10-8

Azure DevOps

Summary of impact: Between 11:40 and 13:30 UTC on 08 Oct 2018, a subset of Azure DevOps customers may have experienced error messages, latency, and/or authentication issues. For more information regarding this issue, please refer to:

In addition, a small subset of Azure Portal customers may have experienced issues logging into the Azure portal.

Preliminary root cause: Engineers determined that a network device experienced a hardware fault and that network traffic was not automatically rerouted.

Mitigation: Engineers took the faulty network device out of rotation and rerouted network traffic to mitigate the issue.

10-4

Azure DevOps

Summary of impact: Between 17:47 and 18:40 UTC on 04 Oct 2018, users with Azure DevOps organizations in South Central US may have experienced errors while using the service.

Preliminary root cause: Engineers determined that this was caused by a networking event that impacted communication between one Azure DevOps scale unit in South Central US from the rest of the world.

Mitigation: After the network event was self-recovered, Azure DevOps performed a manual action to reset our web servers which expedited the recovery from the network incident.

Next steps: Azure DevOps and Networking teams will continue to investigate to establish the full root cause and prevent future occurrences.

10-3

Azure DevOps

Summary of impact: Between 16:45 and 18:15 UTC on 03 Oct 2018, a subset of customers using Azure DevOps may have been unable to access DevOps resources. Additional information can be found at

Preliminary root cause: Engineers determined that a network device experienced a hardware fault, causing a DevOps backend service to become unavailable.

Mitigation: Engineers took the faulty network device out of rotation and rerouted network traffic to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

2018년 9월

9-30

Multiple Services - Korea South

Summary of impact: Between 13:50 and 19:10 UTC on 30 Sep 2018, a subset of Storage customers in Korea South may have experienced difficulties connecting to resources hosted in this region. Some services with dependencies on Storage in the region were also impacted. 

Preliminary root cause: Engineers determined that a single storage scale unit experienced impact to a limited subset of its storage resources, and engineers are still investigating the underlying cause. 

Mitigation: Engineers manually restarted the impacted Storage nodes to mitigate the issue. 

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

9-26

RCA - Multiple Services - Southeast Asia

Summary of impact: Between 09:52 and 12:28 UTC on 26 Sep 2018, a subset of customers in Southeast Asia may have experienced latency or difficulties connecting to Virtual Machines and/or Cloud Service resources hosted in this region. Customers using a number of related services including: App Services, Redis Cache, Application Insights, Stream Analytics, and API Management may have also experienced latency or difficulties connecting to these services in the region.

Workaround: This incident impacted a single scale unit in the Southeast Asia region.  Customers could have reduced the impact of this incident by using VM Scale Sets, or by deploying in more than one region and using Azure Traffic Manager to direct requests to a healthy deployment. 

Root cause and mitigation: In Azure's network architecture, each rack of servers in a data center row connects to 8 network switches located in the middle of the row. This provides 8-way network redundancy at the row level, and makes it very unlikely that failures of the row-level switches will impact customer traffic.

In this incident, there was a failure of the RS-232 serial line aggregator that provides connectivity to the console ports of all 8 row-level switches. The flapping behavior of the console ports triggered a bug in the console line driver of the Operating System kernel running on the 8 row-level network switches, and caused a correlated failure of all 8 row-level switches. The failure of all 8 row-level switches led to loss of network connectivity to all servers in the racks located in that row. That, in turn, caused impact to all services hosted on the servers and caused the reboot of most VMs running on the servers.

Engineers mitigated this incident by restarting the affected row-level switches and recovering impacted downstream services. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. We are taking the following steps to prevent a recurrence of this type of issue:

1. Repair the tty communications subsystem in the operating system running on the switch to be resilient to console line state changes [In Progress]
2. Add code to the switch to recover itself even if the tty communications subsystem fails [In Design]
3. Add alerting to detect switches failing in this manner [Completed]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:


9-19

Log Analytics - Latency, Timeouts and Service Management Failures

Summary of impact:  Between 10:54 and 18:45 UTC on 19 Sep 2018, a subset of customers using Log Analytics and/or other downstream services may have experienced latency, timeouts or service management failures.

Impacted services and experience: 

Log Analytics - Difficulties accessing data, high latency and timeouts when getting workspace information, running Log Analytics queries, and other operations related to Log Analytics workspaces.
Service Map - Ingestion delays and latency.
Automation-  Difficulties accessing the Update management, Change tracking, Inventory, and Linked workspace blades.
Network Performance Monitor - Incomplete data in tests configured.

Preliminary root cause: A backend Log Analytics service experienced a large number of requests which caused failures in several other dependent services.

Mitigation: Engineers applied a backend scaling adjustment to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

9-15

Virtual Machines - Metrics Unavailable

Summary of impact: Between 06:30 UTC on 15 Sep 2018 and 22:35 UTC on 17 Sep 2018, a subset of customers may have experienced difficulties viewing Virtual Machines and/or other compute resource metrics. Metric graphs including CPU (average), Network (total), Disk Bytes (total) and Disk operations/sec (average) may have appeared empty within the Management Portal. Auto-scaling operations were potentially impacted as well. 

Root cause: Scheduled maintenance was being completed on a gateway component within the metric ingestion pipeline.  During this time one instance was taken down, and leaving up another instance to service traffic.  When this transition occurred it was not properly handled by the publishing agent, as the result it failed to continue publication for a subset of customers Virtual Machines metrics.  In parallel, the agent uses a VIP to communicate with the ingestion gateway at the frontend.  When the VIP was failed over, the agent did not properly detect and re-establish connectivity to the functioning instance during the maintenance period.  The gateway did not use reserved IPs, and received a new endpoint upon maintenance completion.  These two triggers caused the agent to reach an invalid state it was not able to recover without manual mitigation.

Mitigation: A change was made to the agent which forced it to recycle and re-establish a proper connection to the gateway component.  Once this occurred, metric publication began flowing as expected.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future.  Steps specific to this incident include:

1. Further localization of the publication pipeline to minimize the potential of broad impact, when future maintenance is performed [In progress].
2. Key monitoring gaps in the pipeline have been identified to more rapidly allow the issues of this nature to be detected and resolved without impact in the future. [In progress]
3. Failure recovery has been improved in the client to avoid long periods of publication to a non-healthy gateway. [In progress]

 
9-15

Metrics Missing - Germany

Summary of impact: Between 06:30 UTC on 15 Sep 2018 and 22:35 UTC on 17 Sep 2018, a subset of customers may have experienced difficulties viewing Virtual Machines and/or other compute resource metrics. Metric graphs including CPU (average), Network (total), Disk Bytes (total) and Disk operations/sec (average) may have appeared empty within the Management Portal. Auto-scaling operations were potentially impacted as well. 

Root cause: Scheduled maintenance was being completed on a gateway component within the metric ingestion pipeline. During this time one instance was taken down, leaving up another instance to service traffic. When this transition occurred, it was not properly handled by the publishing agent, as the result it failed to continue publication for a subset of customers Virtual Machines metrics. In parallel, the agent uses a VIP to communicate with the ingestion gateway at the frontend. When the VIP was failed over, the agent did not properly detect and re-establish connectivity to the functioning instance during the maintenance period. The gateway did not use reserved IPs and received a new endpoint upon maintenance completion. These two triggers caused the agent to reach an invalid state it was not able to recover without manual mitigation. 

Mitigation: A change was made to the agent which forced it to recycle and re-establish a proper connection to the gateway component. Once this occurred, metric publication began flowing as expected. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future.  Steps specific to this incident include:

1. Further localization of the publication pipeline to minimize the potential of broad impact, when future maintenance is performed [In progress].
2. Key monitoring gaps in the pipeline have been identified to more rapidly allow the issues of this nature to be detected and resolved without impact in the future. [In progress]
3. Failure recovery has been improved in the client to avoid long periods of publication to a non-healthy gateway

9-5

RCA - Azure Active Directory - Multiple Regions

Summary of impact: Between as early as 09:00 UTC on 05 Sept 2018 and as late as 05:50 UTC on 10 Sept 2018, a small subset of Azure Active Directory (AAD) customers may have experienced intermittent authentication failures when connecting to resources in the following regions: Japan, India, Australia, South Brazil, and East US 2.

Root cause and mitigation: A recent change to the Azure VM platform resulted in interference with the HTTP settings used by the Azure Active Directory (AAD) proxy service. This led to the failure of a subset of HTTP requests arriving at the affected AAD regions. The issue was initially mitigated by rebooting the proxy servers in the affected regions and reverting to the required AAD settings. The issue was finally resolved by deploying an AAD proxy configuration change that prevented the interference from the platform change.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future:
1. Continuous improvement to the communication and change management processes as related to the Azure platform updates.
2. Continuous improvement to AAD proxy monitoring and alerting mechanisms in order to reduce incidents’ impact and duration.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

9-4

RCA - South Central US

Summary of event:
In the early morning of September 4, 2018, high energy storms hit southern Texas in the vicinity of Microsoft Azure’s South Central US region.  Multiple Azure datacenters in the region saw voltage sags and swells across utility feeds.  At 08:42 UTC, lightning caused a large swell which was immediately followed by a large sag in one of the region’s datacenters, which fell below the required voltage specification for the chiller plant. By design this sag triggered the chillers to power down and lock out, to protect this equipment.  
The expected behavior in this occurrence is for the Mechanical Electrical Plumbing (MEP) management control system to invoke one of the redundant cooling systems in the facility until the chiller plant is able to recover. This is an automated operating procedure and is the standard design in many Microsoft datacenters around the world, having withstood similar situations without any issues.  In the impacted datacenter, the redundant cooling system had been switched to manual mode for invoking the failover of cooling. It was set to a manual failover following the installation of new equipment in the facility where all testing had not been completed.
The manual procedure for failing over the cooling mechanisms is part of our statement of operations and this was executed successfully in one of the other datacenters in the South Central US region, 30 mins prior to this event. Onsite engineers in the impacted datacenter received multiple alarms so were investigating several situations reported in the facility. Cooling alerts which should have been manually actioned were not. Consequently, temperatures began to rise, eventually reaching temperatures that triggered infrastructure devices and servers to shutdown. This shutdown mechanism is intended to protect infrastructure from overheating but, in this instance, temperatures increased so quickly in parts of the datacenter that a number of storage servers were damaged, as well as a small number of network devices. 
While storms were still active in the area, onsite teams took a series of actions to prevent further issues – including transferring the datacenter to generators, thereby stabilizing the power supply.  Initial focus of recovery was the Azure networking infrastructure.  The second focus was to recover the storage servers and the data on these servers.  The decision was made to work towards recovery of data and not fail over to another datacenter, since a failover would have resulted in limited data loss due to the asynchronous nature of geo replication.  Recovering the storage services involved replacing failed infrastructure components, migrating customer data from damaged servers to healthy servers, and validating the integrity of the recovered data. This process took time due to the number of servers that required manual intervention, and the need to work carefully to maintain customer data integrity above all else.

Customer impact:
[1] Impact to resources in South Central US
Customer impact began at approximately 09:29 UTC when infrastructure devices and servers in the datacenter began shutting down as a result of the rising temperatures. This impacted a subset of customers using Azure services that depended on this infrastructure, specifically: Storage, Virtual Machines, Application Insights, Cognitive Services & Custom Vision API, Backup, App Service (and App Services for Linux and Web App for Containers), Azure Database for MySQL, SQL Database, Azure Automation, Site Recovery, Redis Cache, Cosmos DB, Stream Analytics, Media Services, Azure Resource Manager, Azure VPN gateways, PostgreSQL, Application Insights, Azure Machine Learning Studio, Azure Search, Data Factory, HDInsight, IoT Hub, Analysis Services, Key Vault, Log Analytics, Azure Monitor, Azure Scheduler, Logic Apps, Databricks, ExpressRoute, Container Registry, Application Gateway, Service Bus, Event Hub, Azure Portal IaaS Experiences- Bot Service, Azure Batch, Service Fabric and Visual Studio Team Services (VSTS). Although the majority of customers impacted by these services were mitigated by 11:00 UTC on 5 September, full mitigation was not until approximately 08:40 UTC on 7 September when the final storage accounts were brought back online. The extended Storage recovery was due to damaged infrastructure which required repairs while bringing Storage disks back online.  In this incident, customers that had enabled RA-GRS still had read-only access to the geo-replicated copy of the impacted storage accounts.

[2] Impact to Azure Service Manager (ASM)
Insufficient resiliency for Azure Service Manager (ASM) led to the widest impact for customers outside of South Central US. ASM performs management operations for all ‘classic’ resource types. This is often used by customers who have not yet adopted Azure Resource Manager (ARM) APIs which have been made available in the past few years. ARM provides reliable, global resiliency by storing data in every Azure region. ASM stores metadata about these resources in multiple locations, but the primary site is South Central US. Although ASM is a global service, it does not support automatic failover. As a result of the impact in South Central US, ASM requests experienced higher call latencies, timeouts or failures when performing service management operations. To alleviate failures, engineers routed traffic to a secondary site. This helped to provide temporary relief, but the impact was fully mitigated once the associated South Central US storage servers were brought back online at 01:10 UTC on 5 September.

[3] Impact to Azure Resource Manager (ARM)
The ARM service instances in regions outside of South Central US were impacted due to the dependencies called out in [8] below, which includes subscription and RBAC role assignment operations.  In addition, because ARM is used as a routing and orchestration layer (for example, when using ARM templates to interact with ‘classic’ resources) customers may have experienced latency, failures or timeouts when using PowerShell, CLI or the portal to manage resources through ARM that had underlying dependencies on ASM or other impacted services mentioned in this RCA.

[4] Impact to Azure Active Directory (AAD)
As infrastructure shutdown from 9:29 UTC, Azure AD authentication requests routed as expected to sites outside of South Central US.  Requests were successfully being managed until 11:00 UTC when the first signs of degradation occurred for customers located in North America, predominantly due to three factors. First, the AAD autoscale operation that is designed to scale ahead of demand did not complete. The autoscale failure was due to a dependency on the Azure ASM API which, as discussed earlier, was unable to process service management operations.  Second, as AAD sites neared safe utilization thresholds, Quality of Service throttling mechanisms engaged, resulting in authentication failures and timeouts. Third, the engineering team discovered a bug that triggered an unexpected behavior in Office clients which resulted in aggressive retry logic, exacerbating the increased load on the service.

[5] Impact to Visual Studio Team Services (VSTS)
VSTS organizations hosted in South Central US were down. Some VSTS services in this region provide capabilities used by services in other regions, which led to broader impact – slowdowns, errors in the VSTS Dashboard functionality, and inability to access user profiles stored in South Central US to name a few. Customers with organizations hosted in the US were unable to use Release Management and Package Management services. Build and release pipelines using the Hosted macOS queue failed. To avoid data loss, VSTS services did not failover and waitesd for the recovery of the Storage services. After VSTS services had recovered, additional issues for customers occurred in Git, Release Management, and some Package Management feeds due to Azure Storage accounts that had an extended recovery. This VSTS impact was fully mitigated by 00:05 UTC on 6 September. The full RCA for VSTS can be found at .

[6] Impact to Azure Application Insights
Application Insights resources across multiple regions experienced impact. This was caused by a dependency on Azure Active Directory and platform services that provide data routing. This impacted the ability to query data, to update/manage some types of resources such as Availability Tests, and significantly delayed ingestion. Engineers scaled out the services to more quickly process the backlog of data and recover the service.  During recovery, customers experienced gaps in data, as seen in the Azure portal; Log Search alerts firing, based on latent ingested data; latency in reporting billing data to Azure commerce; and delays in results of Availability Tests. Although 90% of customers were mitigated by 16:16 UTC on 6 September, full mitigation was not until 22:12 UTC on 7 September.

[7] Impact to the Azure status page
The Azure status page (status.azure.com) is a web app that uses multiple Azure services in multiple Azure regions. During the incident, the status page received a significant increase in traffic, which should have caused a scale out of resources as needed. The combination of the increased traffic and non-optimized auto-scale configuration settings prevented the web app from scaling properly to handle the increased load. This resulted in intermittent 500 errors for a subset of page visitors, starting at approximately 12:30 UTC. Once these configuration issues were identified, our engineers adjusted the auto-scale settings, so the web app could expand organically to support the traffic. The impact to the status page was fully mitigated by 23:05 UTC on 4 September.

[8] Impact to Azure subscription management
From 09:30 UTC on 4 September, Azure subscription management experienced five separate issues related to the South Central US datacenter impact. First, provisioning of new subscriptions experienced long delays as requests were queued for processing. Second, the properties of existing subscriptions could not be updated. Third, subscription metadata could not be accessed, and as a result billing portal requests failed. Fourth, customers may have received errors in the Azure management and/or billing portals when attempting to create support tickets or contacting support. Fifth, a subset of customers were incorrectly shown a banner asking that they contact support to discuss a billing system issue. Impact to subscription management was fully mitigated by 08:30 UTC on 5 September.

Next steps:
We sincerely apologize for the impact to affected customers. Investigations have assessed the best ways to improve architectural resiliency and mitigate the issues that contributed to this incident and its widespread impact. A detailed forensic analysis of the impacted datacenter hardware and systems is complete, and thorough reviews of datacenter recovery procedures is ongoing.  Several changes are underway:

1. Failure of the cooling redundancy via MEP automation is understood in this incident, and has been remediated – cooling systems in all datacenters in South Central US, and around the world, have been reviewed with no other systems requiring remediation.

2. A review with every Azure service to identify dependencies on the ASM API continues.  Service teams are exploring paths to move these services from ASM to the newer ARM platform. Immediate efforts are to prioritize autoscale operations succeeding in the unlikely event of ASM degradation or failure.

3. Additional improvements to AAD resiliency that would prevent customer impact in the case of a surge in authentication rates are in progress. This is particularly focused on ensuring that the ASM API responsible for the autoscale operation is highly available.

4. Necessary fixes have been released for Microsoft Office clients, to prevent the unexpected surge in authentication traffic.

5. An evaluation of the future hardware design of storage scale units is underway, to increase resiliency to environmental factors. In addition, for scenarios in which impact is unavoidable, we are determining software changes to automate and accelerate recovery.

6. Microsoft takes the responsibility of preserving your data seriously. If there is any chance of recovering the data in the primary region, Microsoft will delay the failover and focus on recovering your data. However, as announced at Ignite 2018, Azure Storage services are planning to allow customers using Geo-Redundant Storage (GRS) to make their own decision about failing over a Storage accounts to its geo-replicated region. This will give customers an option of bringing Storage accounts back online for both read and write access for workloads that can accept a potential risk of data loss due to the nature of asynchronous replication.

7. Impacted customers will receive a credit pursuant to the Microsoft Azure Service Level Agreement, in their October billing statement.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

2018년 8월

8-20

RCA - Storage - West Europe

Summary of impact:
Between 08:40 and 13:43 UTC on 20 Aug, 2018, a subset of customers using Azure Storage may have been unable to create ARM storage accounts in West Europe. Customers may have received 400 HTTP Bad Request responses with an error code of “RequestIsNotValidJson” even for valid, well-formed requests.  Customers attempting to deploy a new VM in the affected regions may have also encountered failures due to this issue. Other services and operations which rely on the creation of storage accounts would have similarly been impacted.  Other storage service management operations were not impacted. Reading and writing data within existing storage accounts was also not affected.

Customer impact workaround:
Customers may have chosen to use an existing account or create a new account in an unaffected region or create classic storage accounts.

Root cause and mitigation:
The incident was caused by a bug in the Storage Resource Provider (SRP). The bug was triggered in the create account path when we were evaluating the scale unit to create the account in. The bug only manifests under specific configuration settings which was only present in West Europe. Due to the bug, customers would have experienced account creation failures. The bug was not found during testing, and the service had been deployed to other production regions without issue before encountering the incident in West Europe. 
Engineers mitigated the issue by making a configuration update. 

Next steps:
We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future.
1) We are working on a fix to make the account creation more robust in the event of a similar bug should they re-occur in the future [in progress].
2) We are also working on completing implementation of per-region flighting of new deployments for Storage with automatic rollback whenever core scenarios are impacted [in progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

8-20

Iot Hub - West Europe and North Europe

Summary of Impact: Between 17:30 UTC on 20 Aug 2018 and 16:42 UTC on 21 Aug 2018, a subset of IoT Hub customers in West Europe and North Europe have may experienced errors, such as 'InternalServerError', when trying to access IoT Hub devices hosted in these regions. Customers may also experience issues when attempting to delete resources/devices.

Root cause and mitigation: A change was deployed that introduced a possibility of threads entering a locked condition and waiting indefinitely, thus failing all subsequent calls to the partition. The problem went undetected in tests as it was very infrequent and was encountered only in specific traffic patterns. Engineers performed a manual reboot of the affected backend service nodes to mitigate the issue. Furthermore, a permanent fix has been identified and deployed to the entire service. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, such as: Improve telemetry to detect and more quickly mitigate issues that arise during deployment activities [in progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

8-13

RCA - Latency and Slow I/O issues in East US

Summary of impact: Between 18:45 and 21:35 UTC on 13 Aug 2018, a subset of customers in East US would have experienced some of the below issues with their Azure services:
• Latency and potentially connectivity loss to Virtual Machines and Storage accounts
• Latency or failed requests for traffic going over an Express Route circuit
• Latency or failures when attempting to Remote Desktop Protocol (RDP) into their Virtual Machines.

Customer impact: This incident impacted a single region.  Services deployed with geographic redundancy may have been able to operate out of other regions. 
For services located in East US, retrying requests or failed operations on a new TCP or UDP connection would have had a 93% chance of success.

Root cause and mitigation: This incident resulted from one network router that carries traffic between data centers in the East US region being put back into service without sufficient uplink capacity available to handle its share of the load. There are multiple routers filling this role in the East US region, and the correct behavior would have been to leave this network router out of service until it had the amount of uplink capacity required. Once engineers discovered the congested link, the network router was removed from service and network operations returned to normal.

Next steps: Azure Networking understands that this issue caused significant impact to our customers, and we are taking the following steps to prevent a recurrence:
1. Reinstate alerting for high utilization links. [Completed]
2. Create a system that continually verifies that critical alerts, like high utilization alarms, are functioning properly. [In Planning]
3. Create a separate alert to warn on-call engineers about links with high rates of packet discard. [In Progress]
4. Extend the faulty link repair system to perform the health checks for the engineer, removing human action and judgement from the process. [In Progress]
5. Update the troubleshooting guide to unambiguously specify how to check that there is sufficient capacity available. [Completed]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

8-1

Azure Service Management Failures

Customer Impact: Between 21:14 UTC on 1st Aug and 00:00 UTC 2nd Aug 2018, a subset of customers may have experienced timeouts or failures when attempting to perform service management operations for Azure Resources dependent on management.core.windows.net, this is Azure’s v1 API endpoint. In addition, some customers may have experienced intermittent latency and connection failures to their Classic Cloud Service resources.

Prelimary Root Cause and Mitigation: Underlying Storage experienced larger than normal loads by our internal services, engineers manually reduced the load to mitigate the issue.
8-1

RCA - Networking Connectivity and Latency - East US

Summary of impact: Between 13:18 UTC and 22:40 UTC on 01 Aug 2018, a subset of customers in the East US region may have experienced difficulties or increased latency when connecting to some resources. 

Customer impact: This incident affected a single region, East US, and services deployed with geo-redundancy in multiple regions would not have been impacted. 

Root cause and mitigation: A fiber cut caused by construction approximately 5 km from Microsoft data centers resulted in multiple line breaks, impacting separate splicing enclosures that reduced capacity between 2 Azure data centers in the East US region. As traffic ramped up over the course of the day, congestion occurred and caused packet loss, particularly to services hosted in or dependent on one data center in the region.

Microsoft engineers reduced the impact to Azure customers by reducing internal and discretionary traffic to free up network capacity. Microsoft had been in preparations to add capacity between data centers in the region, and it completed the mitigation of the issue by pulling in the activation of the additional capacity. Repair of the damaged fiber continued after the mitigation, and it will be returned to service when fully ready.

Next steps:
Microsoft regrets the impact to our customers and their services. We design our networks to survive the failure of any single component, however in this instance there was not enough capacity left after the failure to handle all offered traffic. To prevent recurrence, we are taking the following steps:

1) Reducing the utilization threshold for capacity augmentation - in progress.
2) Improving our systems for automatically reducing discretionary traffic during congestion events to preserve performance for customer traffic - in progress. 
3) Improving our ability to identify high-rate flows that can cause impact to other traffic during congestion events and create appropriate ways to handle this traffic - in progress. 

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

2018년 7월

7-31

RCA - Storage - North Central US

Summary of impact: Between 17:52 UTC and 18:40 UTC on 31 Jul 2018 a subset of customers using Storage in North Central US may have experienced difficulties connecting to resources hosted in this region. Other services that leverage Storage in this region may also have been experiencing impact related to this.

Root cause and mitigation: During a planned power maintenance activity, a breaker tripped transferring the IT load to its second source. A subset of the devices saw a second breaker trip resulting in the restart of a subset of Storage nodes in one of the availability zones in North Central region. Maintenance was completed with no further issues or impact to services. A majority of Storage resources self-healed. Engineers manually recovered the remaining unhealthy Storage nodes. Additional manual mitigating actions were performed by Storage engineers to fully mitigate the incident.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Engineers have sent failed components to the OEM manufacturer for further testing and analysis [in progress]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey
7-28

App Service - East US - Mitigated

Summary of impact: Between 08:00 and 12:30 UTC on 28 Jul 2018, a subset of customers using App Services in East US may have received HTTP 500-level response codes, have experienced timeouts or high latency when accessing App Service (Web, Mobile and API Apps) deployments hosted in this region.

Preliminary root cause: Engineers identified an internal configuration conflict as part of an ongoing deployment as the potential root cause.

Mitigation: Engineers made a manual adjustment of the deployment configuration to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

7-26

RCA - Azure Service Management Failures

Summary of impact: Between 22:15 on 26 Jul 2018 and 06:20 UTC on 27 Jul 2018, a subset of customers may have experienced timeouts or failures when attempting to perform service management operations for Azure Resources dependent on management.core.windows.net, this is Azure’s v1 API endpoint.  In addition, some customers may have experienced intermittent latency and connection failures to the Azure Management Portal. Some services with a reliance on triggers from service management calls may have seen failures for running instances. 

Root cause and mitigation: The Azure Service Management layer (ASM) is composed of multiple services and components. During this event, a platform service update introduced a dependency conflict between two backend components. 

Engineers established that one of the underlying components was incompatible with the newer version of a dependent component. When this incompatibility occurs, internal communications slow down and may fail. The impact of this is that above a certain load threshold, newer connections to ASM would have a higher fault rate. Azure Service management has a dependency on the impacted components and consequently, was partially impacted. 

The ASM update went through the extensive standard testing, and the fault did not manifest itself when the update transited through first, the pre-production environments, and later through the first Production slice where it baked for three days. There was no indication of such a failure in those environments during this time. 

When the update rolled out to the main Production environment, there was no immediately visible impact.  Hours later, engineers started seeing alerts on failed connections. This was challenging to diagnose as ASM appeared internally healthy and the intermittent connection failures could have been due to multiple reasons related to network or other parts of the infrastructure. The offending components were identified based on a review of detailed logging of the impacted components, and the incident was mitigated by removing the dependency on the faulted code path.

A secondary issue occurred coincidentally which prevented notifications from reaching the Azure Service Health Dashboard or the Azure Portal Service Health blade.  Engineers have determined that as part of a deployment earlier, a bug was introduced in the communications tooling. The issue was discovered during the event, and a rollback was performed immediately. While it was not related to the incident above, it delayed our notifications to our customers.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Refine validation and service upgrade processes to prevent future occurrences with customer impact [In Progress]
2. Enhance telemetry for detecting configuration anomalies of the backend components post service upgrade to mitigate customer impact scenarios faster [In Progress]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

 

7-24

Delays in ARM generated Activity Logs

Summary of impact: Between 23:00 UTC on 24 Jul 2018 and 07:00 UTC on 27 Jul 2018, customers may have not received activity logs for ARM resources. All service management operations for ARM resources would have been successful, however the corresponding activity logs would not have been generated. As a result, any dependent actions or workflows that are dependent on ARM generated activity logs would have not been initiated. Engineers have confirmed that customers can receive ARM generated activity logs as of 07:00 on the 27 July 2018. However, at this time customers may not able to access the activity logs that were generated during this impact time frame. 

Preliminary root cause: Engineers determined that a recent Azure Resource Manager deployment task impacted instances of a backend service which became unhealthy, preventing requests from completing.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue. 

Next steps: This incident is remaining active as engineers access options for the missing logs that would have been generated during the impact window. Any customers who remains impacted will receive communications via their management portal. This is being tracked under the tracking id: FLLH-MRG.

7-23

Error notifications in Microsoft Azure Portal

Summary of impact: Between approximately 20:00 and 22:56 UTC on 23 Jul 2018, a subset of customers may have received timeout errors or failure notifications when attempting to load multiple blades in the Microsoft Azure Portal. Customers may also have experienced slowness or difficulties logging into the Portal. 

Preliminary root cause: Engineers determined that instances of a backend service became unhealthy, preventing these requests from completing. 

Mitigation: Engineers performed a change to the service configuration to return the backend service to a healthy state and mitigate the issue. 

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.