Azure durum geçmişi

Ürün:

Bölge:

Tarih:

Şubat 2017

19.2

Multiple Services | Resolved - West US 2

Summary of ımpact: Between 09:50 UTC and 22:00 UTC on 19 Feb 2017, a subset of customers using Virtual Machines, Cloud Services, and SQL Database in West US 2 may have experienced timeouts or errors when accessing their services in this region. Additional services experienced downstream impact, which included App Service \ Web Apps, Azure IoT Hub, Backup, Site Recovery, Azure Monitor, Activity Logs, Redis Cache, DocumentDB, Service Bus, Event Hub, DevTest Labs, Azure Scheduler, Logic Apps, and Azure Batch.

Prelımınary root cause: This issue was caused by an automated shutdown of a small number of Compute and Storage clusters in one of the West US 2 Datacenters. This shutdown was in response to an infrastructure monitoring alert. The automated shutdown sequence is executed by design to ensure hardware integrity and data resiliency.

Mıtıgatıon: The infrastructure alert was mitigated, and engineers then began the structured restart sequence to restore all impacted resources.

Next steps: A detailed root cause analysis will be provided within 72 hours.

17.2

Network Infrastructure - Central US

Summary of ımpact: Between 13:52 and 14:36 UTC on 17 Feb 2017, a subset of customers may have experienced latency or timeouts when accessing services in or via the Central US region.

Prelımınary root cause: Engineers determined that a peering router experienced a hardware fault and that an automatic failover did not occur.

Mıtıgatıon: Engineers performed a manual failover of a back end service to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

16.2

Logic Apps | Resolved

Summary of ımpact: Between 00:00 and 16:10 UTC on 16 Feb 2017, a subset of customers using Logic Apps connecting to Azure Functions may have received intermittent timeouts or failures when connecting to resources. Customers with continuously running Function Apps will require a restart for mitigation to take effect.

Prelımınary root cause: Engineers identified a recent deployment task containing a software error as the potential root cause.

Mıtıgatıon: Engineers rolled back the recent deployment task to mitigate the issue.

Next steps: Engineers will review deployment procedures to prevent future occurrences.

14.2

Cognitive Services | Resolved

Summary of ımpact: Between as early as 22:00 UTC on 13 Feb 2017 and 4:00 UTC on 14 Feb 2017, a subset of customers using Cognitive Services may have received intermittent timeouts or errors when making API requests or generating tokens for their Cognitive Services. 

Prelımınary root cause: At this stage Engineers do not have a definitive root cause.

Mıtıgatıon: Engineers scaled out the service in order to mitigate.

Next steps: Engineers will continue to investigate to establish the full root cause.

11.2

Microsoft Azure portal - Disk Blades not Loading for Custom VHDs

Summary of ımpact: Between 06:21 UTC on 08 Feb 2017 and 02:27 UTC on 11 Feb 2017, a subset of customers using the Microsoft Azure portal () may not have been able to load disk blades associated with Azure Resources Manager (ARM) Virtual Machines custom images. There was no impact to service availability or to service management operations.

Prelımınary root cause: Engineers identified a recent deployment as the potential root cause.

Mıtıgatıon: Engineers deployed a hotfix to mitigate the issue.

Next steps: Engineers will review deployment procedures to prevent future occurrences.

9.2

RCA - Web Apps and other services connecting to SQL Databases in North Central US

Summary of ımpact: Between 14:54 UTC on 7 February 2017 and 11:20 UTC on 8 February 2017, a subset of customers Azure services may have experienced intermittent issues or failure notifications while attempting to connect to their SQL Databases in North Central US. After further investigation, engineers concluded that this impact duration period was different from the times previously communicated during the incident. During this time, SQL Databases that had a restricted access connection using Access Control Lists (ACLs) may have been impacted. A deployed change was incomplete and did not include several prefixes which had been previously allocated and were in service. Customers who had been allocated IPs from a particular range would have been blocked by databases configured to only allow access from that particular range. This change was rolled back as the mitigation. Customers would have had the option to work around this by changing their Database ACL configuration to explicitly list the IPs of any connections that were being dropped.

Customer ımpact: Some customers using Azure Databases in that region who had selected the “All Azure VMs” ACL option would have been unable to connect to this database from VMs in the missing prefixes.

Root cause and mıtıgatıon: The list of prefixes identifying Azure resources connecting to the SQL Databases was being produced by a new mechanism which would be able to provide better aggregation and take advantage of pre-allocation policies to improve safe operational thresholds with less effort and delay. This new mechanism was missing some prefixes that had been added using a temporary process previously. The validation that was used on this prefix list was incomplete and failed to take into account that prefixes were being removed unintentionally.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 1. Re-add the missing prefixes and verify that there are no other such cases (complete); 2. Add a secondary validation to ensure that all currently allocated IPs are represented in the list; 3. Improving our detection mechanisms in this area so we can more quickly alert when there are database connection drops. Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

8.2

Azure Active Directory B2C | Requests Failing

Summary of ımpact: Between 21:13 UTC on 07 Feb 2017 to 21:50 UTC on 07 Feb 2017, a subset of customers using Azure Active Directory B2C may have experienced client side authorization request failures when connecting to resources. Customers attempting to access services would have received a client side error - "We can't sign you in" - when attempting to login.

Prelımınary root cause: Engineers identified a recent deployment task as the potential root cause.

Mıtıgatıon: Engineers rolled back the recent deployment task to mitigate the issue.

Next steps: Engineers will review deployment procedures to prevent future occurrences.

7.2

Resolved : Visual Studio Team Services Issue : Multiple Regions

Summary of ımpact: Between 20:43 UTC on 06 Feb 2017 and 10:20 UTC on 07 Feb 2017, a subset of customers using Visual Studio Team Services may have received the following error when creating PaaS deployment packages via Visual Studio: "Value cannot be null". This error may have been returned after the "Apply Diagnostics Extension" step when provisioning Team Foundation version control projects. Customers using the Azure Management Portal were unaffected.

Prelımınary root cause: Engineers identified a recent deployment task as the potential root cause.

Mıtıgatıon: Engineers rolled back the recent deployment task to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

6.2

SQL Database - North Europe | Mitigated

Summary of ımpact: Between 18:00 to 21:00 UTC on 06 Feb 2017, a subset of customers using SQL Database in North Europe may have experienced intermittent control plane unavailability. Retrieving information about servers and databases in the region through the portal may have failed. Server and database create, drop, rename, and change edition or performance tier operations may have resulted in an error or timeout. Availability (connecting to and using existing databases) was not impacted. Customers would have also seen deployment failure notifications when creating new HDInsight clusters in this region.

Prelımınary root cause: Engineers identified a recent configuration change as the potential root cause.

Mıtıgatıon: Engineers rolled back the recent deployment task to mitigate the issue.

Next steps: Engineers will review deployment procedures to prevent future occurrences.

3.2

RCA - Unable to view metrics in Azure portal

Summary of ımpact: Between 11:40 UTC and 18:45 UTC on 03 Feb 2017, customers may have been unable to view metrics for their resources hosted in multiple regions. Monitoring and metrics tiles for resources in the portal may have showed up as broken and included a grey cloud icon.

Impacted services may have included
Virtual Machines
Cognitive Services
App Service \ Web Apps
Logic Apps
Azure Batch
Virtual Machine Scale Sets
Azure IoT Hub
Azure Search
Service Bus,
Stream Analytics
Application Gateway
Azure Analysis Services and
Event Hubs
Directly calling the Azure Insights API may also result in failures.

Root cause: The root cause was identified as a backend configuration change and caused certain backend resources to not load, which resulted in failures in retrieving monitoring metrics. Azure Engineers deployed a fix to revert the change and verified the backend resources were loading successfully. In addition confirmed monitoring metrics were displayed in the portal correctly and retrieved via the Azure Insights API. The incident was confirmed as resolved at 18:45 UTC.

A misconfiguration in a downstream service exposed a dormant bug within Azure Monitor which prevented the service from loading the metadata configuration for a number of different resource types on-boarded to Azure Monitor. This resulted in some set of our APIs returning Bad Request (400) during the incident since our system did not recognize these resource types. There was no loss of metric data during this time, only the management of the APIs was impacted.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

- Improve the resiliency of this metadata store in Azure Monitor.
- Improve our monitoring capability to detect this sort of issues quicker.
- Harden the review process for rolling out of the configuration change.

Provıde feedback: Please help us improve the Azure customer communications experience by taking our survey