Azure 狀態歷程記錄

產品:

地區:

日期:

2018年2月

2/15

Infrastructure - West Europe

Summary of impact: Between 15:48 and 17:01 UTC on 15 Feb 2018, a subset of customers in West Europe may have experienced difficulties connecting to resources hosted in this region. This may have included SQL Database, IoT Central, App Services, Azure Search, Virtual Machines, Azure Redis Cache, Azure Cosmos DB, Logic Apps, and Storage.

Preliminary root cause: Engineers determined that this was caused by a power event.

Mitigation: The issue was self-healed as the hardware automatically restarted.

Next steps: Engineers will continue to investigate to establish the full root cause and a comprehensive Root Cause Analysis report will be posted on this Status History page once completed.

2018年1月

1/27

Service Bus - West US

Summary of impact: Between 07:10 and 11:42 UTC and then again from 13:05 and 17:40 UTC on 27 Jan 2018, a subset of customers using Service Bus in West US may have experienced difficulties connecting to resources hosted in this region.

Preliminary root cause: Engineers determined that instances of a backend service responsible for processing service management requests became unhealthy, preventing requests from completing.

Mitigation: Engineers performed a manual restart of a backend service to mitigate the issue. 

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

1/24

App Service - East US

Summary of impact: Between 08:02 and 15:21 UTC on 24 Jan 2018, a subset of customers using App Service in East US may have experienced intermittent latency, timeouts or HTTP 500-level response codes while performing service management operations such as site create, delete and move resources on App Service deployments. Auto-scaling and loading site metrics may also have been impacted.

Preliminary root cause: Engineers determined that a single scale unit had reached an operational threshold manifesting in increased latency and timeouts for impacted customers.

Mitigation: Engineers performed a change to the service configuration to optimise traffic which mitigated the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

1/23

App Service - Service management issues

Summary of impact: Between 20:00 UTC on 22 Jan 2018 and 21:30 UTC on 23 Jan 2018, a subset of customers may have encountered the error message "The resource you are looking for has been removed, had its name changed, or is temporarily unavailable" when viewing the "Application Settings" blade under App Services in the Management Portal. This error would have prevented customers from performing certain service management operations on their existing App Service plan. During the impact window, existing App Services resources should have remained functioning in the state they were in.

Preliminary root cause: Engineers determined that a recently deployed update led to these issues.

Mitigation: Engineer deployed a hotfix in order to address the impact.

Next steps: Engineers will review deployment procedures to prevent future occurrences.

1/22

RCA – Resources using IPv4 addressing – West and South India

Summary of impact: Between 21:40 and 21:55 UTC on 22 Jan 2018, customers may have been unable to reach West India and South India from the Internet. Customers in these location may have also been unable to reach to the Internet from their services.

Root cause and mitigation: Azure Engineers were performing a network maintenance to improve routing stability in West India and South India by removing static configurations in network devices. These static routes have been the root cause of network outages during previous incidents and a planned maintenance was scheduled to remove them. Due to a missing set of filters on the routing devices, which was unknown to the maintenance team and undetected by the maintenance software, default routing was removed from South India and West India for a few minutes. Azure Engineers received numerous notifications from monitoring, and immediately remediated the issue by deploying a hotfix.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): repair of the software used to perform routing changes to detect this condition and additional network control plane simulation to locate any other similar conditions in the network [in progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

1/19

RCA - Virtual Machines - West Europe

Summary of impact: Between 08:01 and 13:40 UTC on 19 Jan 2018, a subset of customers with Virtual Machines hosted in West Europe may have experienced intermittent connection failures, time outs or higher than expected latency when accessing their resources. Customers may have also experienced unexpected Virtual Machine restarts or experienced failures when performing service management operations - such as create, update, delete. The creation failure may have intermittently impacted the creation of new HDInsight clusters or any resource dependent on Virtual Machines.

Customer impact: Customers may have experienced higher than expected latency when connecting to existing resources or may have been unable to provision new resources.

Root cause and mitigation: A single aggregation router in the West Europe region experienced congestion and packet loss to a single data center during the incident due to a combination of erroring links, and links already being out of service. This congestion presented itself as difficulty accessing storage resources for a small set of VMs in the West Europe region. Azure uses hardware link encryptors to encrypt packets traveling between data centers in the West Europe region. This event exposed a monitoring gap on these hardware link encryptors. During the initial conditions for this incident, a subset of the links terminated on these encryptors experienced errors and were removed from service for cleaning by the lossy link service. When the service determined it was no longer safe to proceed, due to aggregate bandwidth loss to a single device, the lossy link service triggered a human investigate status for these alarms. The engineers who performed the investigation noted that traffic levels were not high when they checked, and manually removed more links from service on the 18th and the 19th. When organic traffic rates rose on the 19th, congestion and packet loss was observed. The time to mitigate for this incident was extended because the automation handling the congestion and traffic imbalance alarms did not trigger human investigate quickly enough.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Continue to improve the telemetry for monitoring of our critical infrastructure [in progress].
2. Expedite timeline for existing link capacity upgrade schedule [in progress].
3. Update existing alarm signal on traffic imbalance across a set of devices [in progress].

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

1/15

Multiple Azure Services - West US

Summary of impact: Between 22:58 UTC on 14 Jan 2018 and 02:13 UTC on 15 Jan 2018, a limited subset of customers utilizing storage services in West US may have experienced latency or failures connecting to certain resources in this region. In addition to the storage service, services which leveraged storage included: App Services (Web, Mobile and API Apps), Site Recovery, Azure Search, Redis Cache, Service Bus, Event Hubs and Azure Active Directory Gateway. 

Preliminary root cause: Engineers determined that a portion of instances of a single storage scale unit became unhealthy, preventing requests from completing as expected.

Mitigation: For short term mitigation, engineers isolated the unhealthy instances of the storage scale unit, allowing healthy storage components to process requests for these services.

Next steps: Engineers continue to investigate the unhealthy storage instances to establish the full root cause and any required fix.

2017年12月

12/13

Service Bus and Event Hubs - Australia East

Summary of impact: Between 04:00 and 06:40 UTC on 13 Dec 2017, a limited subset of customers using Service Bus and Event Hubs in Australia East may have experienced intermittent issues when connecting to resources from the Azure Management Portal or programmatically. Services offered within Service Bus, including Service Bus Queue and Service Bus Topics may have also been affected.

Preliminary root cause: Engineers determined that certain instances of a front-end service responsible for processing Service Bus and Event Hub operation requests had reached an acceptable operational threshold, preventing requests from completing as timely as expected.

Mitigation: Engineers scaled out new instances to optimize request processing, and monitored front-end telemetry to confirm mitigation.

Next steps: Engineers will continue to investigate to establish the full root cause.

12/6

RCA - App Service - West Europe

Between 10:12 UTC 4 Dec and 15:14 UTC 6 Dec 2017, a subset of customers in West Europe region may have experienced latencies, timeouts, or 5xx errors while accessing their App Service applications. They may have also seen storage latencies during this time. The root cause of the issue was an increased CPU load on the storage scale unit in question. This issue was automatically detected and mitigated by rebalancing resources to resolve load on the storage scale unit by the engineering team.

Customer impact: A subset of customers in West Europe region may have experienced latencies, timeouts, or 5xx errors while accessing their App Service Applications.

Root cause and mitigation: This incident was caused by an increased CPU load on the storage scale unit in question. The CPU increase was detected via regular storage capacity monitoring as well as App Service monitoring. As part of initial mitigation, background processes were halted to free up resources for customer traffic. We also continued to try to regionally load balance more aggressively by targeting high CPU customers and tried to distribute them more evenly across the region to reduce the load on the storage unit. However, since the rate of load growth was faster than the rebalancing rate, the load balancing was not able to keep up. The increased CPU load caused resource contention for some storage operations, leading to increased latency for storage services and the App services associated with them. Additional mitigation steps were taken at this time including an ad-hoc regional load balancing operation that was initiated for WebApps accounts hosted on the scale unit to move them to storage hardware that had more CPU capacity going forward. After mitigation steps were applied, availability and latencies returned to normal ranges.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to) investments to increase the speed of regional load balancing to able to better handle large organic growth spikes.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey