Azure 状态历史记录

产品:

区域:

日期:

2017年2月

2/23

SQL Database - South Central US

Summary of impact: Between 17:57 and 23:10 UTC on 23 Feb 2017, a subset of customers using SQL Database in South Central US may have received intermittent failure notifications or timeouts when performing create server and database operations. Availability (connecting to and using existing databases) was not impacted.

Preliminary root cause: Engineers determined that a back end service responsible for processing service management requests became unhealthy when CPU loads reached thresholds, which prevented requests from completing.

Mitigation: Engineers performed steps to reduce CPU loads on the back end service to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

2/23

HDInsight and SQL Database - East US and South Central US

Summary of impact: Between 17:57 and 21:23 UTC on 23 Feb 2017, a subset of customers using HDInsight in South Central US may have received deployment failure notifications when creating new HDInsight clusters in this region.

Preliminary root cause: Engineers determined that a back end service responsible for processing service management requests became unhealthy when CPU loads reached thresholds, which prevented requests from completing.

Mitigation: Engineers performed steps to reduce CPU loads on the back end service to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

2/19

RCA - Multiple Services | Resolved - West US 2

Summary of impact: Between 09:50 UTC and 22:00 UTC on 19 Feb 2017, a subset of customers in West US 2 may have experienced either partial or complete loss of connectivity to their affected resources in this region.

Customer impact: Due to the loss of connectivity between Virtual Machines and storage resources, Virtual Machines would have been shut down to maintain data integrity. When connectivity was restored, these Virtual Machines restarted. A subset of customers may have experienced longer recovery times for their Virtual Machines and Storage services.

Azure SQL Database customers may have seen errors connecting to their databases.
Azure IoT Hub customers may have faced deployment failures while deploying in affected region.
Backup and Site Recovery customers services may have observed Vault creation failures.
Azure Insight customers may have seen no logs captured during the incident.

Other services, including Activity Logs, App Service \ Web Apps, Azure Monitor, Azure Scheduler, Cloud Services, DevTest Labs, Event Hub, DocumentDB, Logic Apps, Redis Cache, and Service Bus, may have observed connectivity failures.

Root cause and mitigation: This issue was caused by a manual shutdown of a small number of compute and storage scale units in the West US 2 datacenter. This shutdown was in proactive response to an overheating alert. While the scale unit temperatures were within their operational thresholds, this automated shutdown sequence was executed by design to maintain hardware integrity, ultimately ensuring data durability and resiliency for our customers.

Starting at 9:00 UTC, the overheating scale units began to cool and normal operating temperature was restored approximately 30 minutes thereafter. At 9:30 UTC engineers initiated a structured power restoration plan and power was restored for all scale units in the next 1 hour.

Nodes in all Azure scale units are managed by the Platform Fabric Controller (PFC); the PFC needs to be operational for additional services to recover. During the power on procedure, the automated recovery of the Platform Fabric Controller failed due to an internal timeout. This required a fallback to operator-assisted recovery steps which required an additional two hours to complete. Once the Platform Fabric Controller was fully recovered, the additional impacted services were recovered per our thoroughly tested power-down recover plan to further guarantee customer data resiliency.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Review and strengthen the cooling system architecture to prevent future overheating events
2. Address timeout issue in the automated recovery software
3. Improve monitoring and tooling to expedite diagnosis and operator-assisted recovery of the Platform Fabric Controller

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

2/17

Network Infrastructure - Central US

Summary of impact: Between 13:52 and 14:36 UTC on 17 Feb 2017, a subset of customers may have experienced latency or timeouts when accessing services in or via the Central US region.

Preliminary root cause: Engineers determined that a peering router experienced a hardware fault and that an automatic failover did not occur.

Mitigation: Engineers performed a manual failover of a back end service to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

2/16

Logic Apps | Resolved

Summary of impact: Between 00:00 and 16:10 UTC on 16 Feb 2017, a subset of customers using Logic Apps connecting to Azure Functions may have received intermittent timeouts or failures when connecting to resources. Customers with continuously running Function Apps will require a restart for mitigation to take effect.

Preliminary root cause: Engineers identified a recent deployment task containing a software error as the potential root cause.

Mitigation: Engineers rolled back the recent deployment task to mitigate the issue.

Next steps: Engineers will review deployment procedures to prevent future occurrences.

2/14

Cognitive Services | Resolved

Summary of impact: Between as early as 22:00 UTC on 13 Feb 2017 and 4:00 UTC on 14 Feb 2017, a subset of customers using Cognitive Services may have received intermittent timeouts or errors when making API requests or generating tokens for their Cognitive Services. 

Preliminary root cause: At this stage Engineers do not have a definitive root cause.

Mitigation: Engineers scaled out the service in order to mitigate.

Next steps: Engineers will continue to investigate to establish the full root cause.

2/11

Microsoft Azure portal - Disk Blades not Loading for Custom VHDs

Summary of impact: Between 06:21 UTC on 08 Feb 2017 and 02:27 UTC on 11 Feb 2017, a subset of customers using the Microsoft Azure portal () may not have been able to load disk blades associated with Azure Resources Manager (ARM) Virtual Machines custom images. There was no impact to service availability or to service management operations.

Preliminary root cause: Engineers identified a recent deployment as the potential root cause.

Mitigation: Engineers deployed a hotfix to mitigate the issue.

Next steps: Engineers will review deployment procedures to prevent future occurrences.

2/9

RCA - Web Apps and other services connecting to SQL Databases in North Central US

Summary of impact: Between 14:54 UTC on 7 February 2017 and 11:20 UTC on 8 February 2017, a subset of customers Azure services may have experienced intermittent issues or failure notifications while attempting to connect to their SQL Databases in North Central US. After further investigation, engineers concluded that this impact duration period was different from the times previously communicated during the incident. During this time, SQL Databases that had a restricted access connection using Access Control Lists (ACLs) may have been impacted. A deployed change was incomplete and did not include several prefixes which had been previously allocated and were in service. Customers who had been allocated IPs from a particular range would have been blocked by databases configured to only allow access from that particular range. This change was rolled back as the mitigation. Customers would have had the option to work around this by changing their Database ACL configuration to explicitly list the IPs of any connections that were being dropped.

Customer impact: Some customers using Azure Databases in that region who had selected the “All Azure VMs” ACL option would have been unable to connect to this database from VMs in the missing prefixes.

Root cause and mitigation: The list of prefixes identifying Azure resources connecting to the SQL Databases was being produced by a new mechanism which would be able to provide better aggregation and take advantage of pre-allocation policies to improve safe operational thresholds with less effort and delay. This new mechanism was missing some prefixes that had been added using a temporary process previously. The validation that was used on this prefix list was incomplete and failed to take into account that prefixes were being removed unintentionally.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 1. Re-add the missing prefixes and verify that there are no other such cases (complete); 2. Add a secondary validation to ensure that all currently allocated IPs are represented in the list; 3. Improving our detection mechanisms in this area so we can more quickly alert when there are database connection drops. Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

2/8

Azure Active Directory B2C | Requests Failing

Summary of impact: Between 21:13 UTC on 07 Feb 2017 to 21:50 UTC on 07 Feb 2017, a subset of customers using Azure Active Directory B2C may have experienced client side authorization request failures when connecting to resources. Customers attempting to access services would have received a client side error - "We can't sign you in" - when attempting to login.

Preliminary root cause: Engineers identified a recent deployment task as the potential root cause.

Mitigation: Engineers rolled back the recent deployment task to mitigate the issue.

Next steps: Engineers will review deployment procedures to prevent future occurrences.

2/7

Resolved : Visual Studio Team Services Issue : Multiple Regions

Summary of impact: Between 20:43 UTC on 06 Feb 2017 and 10:20 UTC on 07 Feb 2017, a subset of customers using Visual Studio Team Services may have received the following error when creating PaaS deployment packages via Visual Studio: "Value cannot be null". This error may have been returned after the "Apply Diagnostics Extension" step when provisioning Team Foundation version control projects. Customers using the Azure Management Portal were unaffected.

Preliminary root cause: Engineers identified a recent deployment task as the potential root cause.

Mitigation: Engineers rolled back the recent deployment task to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.