Historia stanu platformy Azure

Produkt:

Region:

Data:

październik 2017

11-10

App Service - South Central US

Summary of impact: Between 18:48 and 20:56 UTC on 11 Oct 2017, a subset of customers using App Service in South Central US may receive intermittent HTTP 500-level response codes, experience timeouts or high latency when accessing App Service (Web, Mobile and API Apps) deployments hosted in this region.

Preliminary root cause: Engineers have determined that a backend node became unhealthy leading to a service error.

Mitigation: Engineers performed a manual restart of a backend node to mitigate the issue.

Next steps: Engineers will finalize their root cause analysis and implement any repairs required to prevent recurrence.

10-10

Visual Studio Team Services - Portal Access Issues

Summary of impact: Between 08:16 and 15:00 UTC on 10 Oct 2017, customers using Visual Studio Team Services may have experienced difficulties connecting to resources hosted by VisualStudio.com.

Preliminary root cause: Engineers suspected that a recent deployment increased the load on servers that handle requests to Shared Platform Services.

Mitigation: Engineers scaled out the number of web roles in the Shared Platform Service to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. More information can be found on

9-10

RCA - Service management failures for Backup customers in UK South

Summary of impact: From 06:02 to 14:14 UTC on 09 October 2017, a subset of customers using Backup or Site Recovery in UK South may have received failure notifications when performing service management operations via Powershell or the Azure Management Portal () for resources hosted in this region. All scheduled backups, ongoing replication and any operation from the on premise clients were not impacted during this time.

Root cause and mitigation: During a regular service update, a monitoring alert was triggered reporting service management failures. Engineers investigated and discovered a corrupted service configuration file specific to the UK South region was preventing the service handling the management operations from starting up. The configuration file gets generated via a deployment tool during the service upgrades, and this file was in turn corrupted due to a software bug in the tool. A rollback to previous configuration was not possible and hence the corruption was manually corrected to mitigate the issue.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future and faster mitigation of issues during service upgrades. In this case, we will repair the bug which caused this issue and we are adding additional multi stage detection for this specific scenario.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

7-10

Infrastructure issue impacting multiple Azure services - Australia East

Summary of impact: Between 22:19 and 22:46 UTC on 6 October 2017, a subset of customer resources and services were impacted by network availability loss in a portion of the Australia East region. Facility engineers were quickly aware, took steps to remediate the issue, and network availability was restored by 22:39 UTC. Azure infrastructure and customer resources largely recovered in the few minutes following.

Customer impact: Because network connectivity was lost between some Virtual Machines and storage resources, VMs would have been shut down and then restarted once connectivity was restored. Additional impacted services include Cloud Services, Azure Search, Service Bus, Event Hub, DevTest Lab, Azure Site Recovery, Azure Key Vault, Visual Studio Team Services, and may have experienced latency or loss in availability to/from portions of the Australia East region.

Root cause and mitigation: During a planned facility power maintenance activity, power was removed from a single feed. No impact was expected from this planned activity, as there are redundant power feeds in the facility. Facility engineers were in the datacenter monitoring this planned activity, and detected unexpected breaker trips on the redundant feeds shortly after the start of maintenance. Engineers performed immediate investigation, and determined that the datacenter spine network devices in the portion of the facility impacted by the power maintenance had lost power. Engineers mitigated the loss of power to these devices by distributing load across additional circuits restoring network connectivity. A review of the datacenter spine network devices revealed that these devices were not power striped across feeds optimally to be resilient to loss of one of the two power feeds. The devices have been subsequently corrected in this facility, and Microsoft is reviewing design and implementation worldwide for these devices. All other devices, servers, and infrastructure maintained availability, as expected during the maintenance. The maintenance was completed without any further impact as originally expected. Azure networks are designed to be resilient to loss of individual or even multiple datacenter spine devices, however, due to the unexpected breaker trips, all devices in this physical facility were impacted, resulting in a loss of connectivity within the facility and with other segments in the Australia East region.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1) Remediate known vulnerability in Australia East facility [COMPLETE]
2) Facility team continuing to work with network engineering on design review to audit and remediate worldwide. [PENDING]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey 

4-10

Azure Cloud Shell - Authentication Issues

Summary of impact: Between 02:30 and 08:30 UTC on 04 Oct 2017, a subset of customers using Cloud Shell may have experienced the following authentication error when running Azure CLI command: "A Cloud Shell credential problem occurred. When you report the issue with the error below, please mention the hostname 'host-name'. Could not retrieve token from local cache."

Preliminary root cause: From initial investigations, engineers suspect that there was an issue with the cloud shell images used in the backend to provision this service, which were causing logins to take longer than normal.

Mitigation: Engineers redeployed the Cloud Shell images with a newer image, which was verified as not having the same symptoms, to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

wrzesień 2017

29-9

RCA - Storage Related Incident - North Europe

Summary of impact: Between 13:27 and 20:15 UTC on 29 Sep 2017, a subset of customers in North Europe may have experienced difficulties connecting to or managing resources hosted in this region due to availability loss of a storage scale unit. Services that depend on the impacted storage resources in this region that may have seen impact are Virtual Machines, Cloud Services, Azure Backup, App Services\Web Apps, Azure Cache, Azure Monitor, Azure Functions, Time Series Insights, Stream Analytics, HDInsight, Data Factory and Azure Scheduler, Azure Site Recovery.

Customer impact: A portion of storage resources were unavailable resulting in dependent Virtual Machines shutting down to ensure data durability. Some Azure Backup vaults were not available for the duration resulting in backup and restore operation failures. Azure Site Recovery may not be able to failover to latest recovery points or replicate VMs. HDInsight, Azure Scheduler and Functions may have experienced service management and job failure where resources were dependent on the impacted storage scale unit. Azure Monitor and Data Factory may have seen latency and errors in pipelines that have dependencies in this scale unit. Azure Stream Analytics jobs stopped processing input and/or producing output for several minutes. Azure Media Services saw failures & latency for streaming requests, uploads, and encoding.

Workaround: Implementation of Virtual Machines in Availability Sets with Managed Disks would have provided resiliency against significant service impact for VM based workloads.

Root cause and mitigation: During a routine periodic fire suppression system maintenance, an unexpected release of inert fire suppression agent occurred. When suppression was triggered, it initiated the automatic shutdown of Air Handler Units (AHU) as designed for containment and safety. While conditions in the data center were being reaffirmed and AHUs were being restarted, the ambient temperature in isolated areas of the impacted suppression zone rose above normal operational parameters. Some systems in the impacted zone performed auto shutdowns or reboots triggered by internal thermal health monitoring to prevent overheating of those systems. The triggering of inert fire suppression was immediately known, and in the following 35 minutes, all AHUs were recovered and ambient temperatures had returned to normal operational levels. Facility power was not impacted during the event. All systems have been restored to full operational conditions and further system maintenance has been suspended pending investigation of the unexpected agent release. Due to the nature of the above event and variance in thermal conditions in isolated areas of the impacted suppression zone, some servers and storage resources did not shutdown in a controlled manner. As a result, additional time was required to troubleshoot and recover the impacted resources. Once the scale unit reached the required number of operational nodes, customers would have seen gradual, but consistent improvement until fully mitigated at 20:15 UTC when storage and dependent services were able to fully recover.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): Suppression system maintenance analysis continues with facility engineers to identify the cause of the unexpected agent release, and to mitigate risk of recurrence. Engineering continues to investigate the failure conditions and recovery time improvements for storage resources in this scenario. As important investigation and analysis are ongoing, an additional update to this RCA will be provided before Friday, 10/13.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

24-9

App Service - South Central US

Summary of impact: Between 23:18 and 23:37 UTC on 23 Sep 2017, a subset of customers using App Service in South Central US may have experienced intermittent latency, timeouts, or HTTP 500-level response codes while performing service management operations such as site create, delete, and move resources on their App Service applications.

Preliminary root cause: Engineers determined preliminary root cause as a backend networking connectivity issue.

Mitigation: Engineers determined that the issue was self-healed by the Azure platform.

Next steps: Engineers are continuing to investigate to establish the full root cause.

22-9

App Service \ Web Apps - North Europe

Summary of impact: Between 10:00 and 12:09 UTC on 22 Sep 2017, a subset of customers using App Service \ Web Apps in North Europe may have received HTTP 500-level response codes, or experienced timeouts or high latency when accessing Web Apps deployments hosted in this region.

Preliminary root cause: At this stage engineers do not have a definitive root cause.

Mitigation: Engineers determined that the issue was self-healed by the Azure platform.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

21-9

Unable to Access Azure Management Portal

Between approximately 12:45 and 16:15 UTC on 21 Sep 2017, a subset of customers may have received intermittent HTTP 503 errors or seen a blue error screen when loading the Azure Management Portal page ().

Preliminary root cause: At this stage engineers do not have a definitive root cause.

Mitigation: Engineers determined that the issue was self-healed by the Azure platform.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

15-9

App Service - North Europe

Summary of impact: Between 19:30 and 20:15 UTC on 15 Sep 2017, a subset of customers using App Service in North Europe may have received HTTP 500-level response codes, have experienced timeouts or high latency when accessing App Service deployments hosted in this region.

Preliminary root cause: At this stage engineers do not have a definitive root cause.

Mitigation: Engineers determined that the issue was self-healed by the Azure platform.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences