Historia stanu platformy Azure
Visual Studio Team Services - Unable To Access Accounts | Recovered
Summary of impact: Between 00:44 UTC on 20 Jan 2017 to 03:50 UTC on 20 Jan 2017, customers using Visual Studio Team Services would have experienced latency or received error notifications when connecting to VSTS accounts.
Preliminary root cause: Engineers determined that a back end service responsible for processing commandlets became unhealthy and was preventing operations to complete.
Mitigation: Engineers manually recycled the back end service to mitigate the issue.
Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. More information at https:/aka.ms/vstsblog .
Microsoft Azure Portal - Multiple Regions
Summary of impact: Between 19:00 UTC on 17 Jan 2017 and 06:15 UTC on 19 Jan 2017, a subset of customers using Microsoft Azure portal may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region.
Preliminary root cause: Engineers identified a recent deployment as the potential root cause.
Mitigation: Engineers rolled back the recent deployment to mitigate the issue.
Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.
Storage - Service Management Operation Failures | Recovered
Summary of impact: Between 17:47 to 18:03 UTC on 18 Jan 2017, a subset of customers using Storage in multiple regions may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region.
Preliminary root cause: At this stage, Engineers do not have a definitive root cause.
Mitigation: The issue was self-healed by the Azure platform.
Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. Impacted customers will receive further communications for this mitigated issue via their Management Portal.
Power BI Embedded - North Europe | InvestigatingAt 11:21 UTC on 17 Jan 2017, Engineers received a monitoring alert for Power BI Embedded in North Europe. We have concluded our investigation of the alert and confirmed that Power BI Embedded is not affected. Further communication for Power BI can be found in the O365 management portal.
Management Portal - Virtual Machine Size Blades not visible
Summary of impact: Between 01:00 UTC 11 Jan 2017 and 23:15 UTC on 12 Jan 2017, a subset of customers may have experienced issues viewing their ‘Virtual Machine Size’ blades. Additionally, impacted customers may have been unable to make changes to the size of their Virtual Machines through the management portal (https://portal.azure.com). Engineers have determined that this was only impacting customers who have created ARM Virtual Machines using a custom image. Customers should log out, and log back into the management portal if they still experience this issue to have it resolved.
Preliminary root cause: Engineers have identified a recent deployment as the root cause.
Mitigation: Engineers have developed and applied a hot fix to mitigate this issue.
Next steps: Engineers will review testing procedures to prevent future occurrences.
Microsoft Azure Portal - Issues Viewing Resources
Summary of impact: Between 22:15 UTC on 10 Jan 2017 and 01:45 UTC on 12 Jan 2017, customers may have experienced high latency when viewing recently deployed resources, or intermittently received failure notifications when attempting to deploy new resource groups or resources to new resource groups in the Azure Portal, https://portal.azure.com. Customers may have seen these resources intermittently in their Portals and may have been able to intermittently manage them through PowerShell or the Command Line Interface. Additionally, customers may have encountered issues viewing existing resources in their portals. All resources continued to exist and are were running as expected.
Preliminary root cause: Engineers identified an increase in backlog requests as the potential root cause.
Mitigation: Engineers added additional resources to handle the increased backlog, as well as reverted configuration values which may affected the increase in backlog requests.
Next steps: Engineers will continue to investigate to establish an underlying root cause and prevent future occurrences.
RCA - Virtual Machines, Storage, SQL Database - West US 2
Summary of impact: From 22:09 UTC until 23:43 UTC on January 10th, a subset of customers with resources located in the West US 2 region may have experienced failures when attempting to connect to those Azure resources and platform services. Alerts were generated by Azure platform monitoring services, and engineers were able to correlate impact to a power issue in the region. To mitigate the potential loss of power, by design, generators automatically started, and began delivering power to the majority of the region. Manual mitigations were required to mitigate a power interruption in one portion of the region, and engineers successfully completed that mitigation restoring power at 22:43 UTC. After stabilizing services running on generator power, utility power returned. When utility power returned, power was switched back to UPS/utility power. During this transfer, a short power interruption occurred resulting in an additional reboot for another portion of the region from 23:22 UTC - 23:35 UTC. Services were confirmed healthy, and no further power issues were detected, but engineers continued to work to understand the cause of the two interruptions.
Customer impact: A subset of customers with resources in the West US 2 Datacenter may have experienced failures/timeouts when attempting to access their resources or perform operations. Virtual Machines may have rebooted once or twice throughout the impact period. Storage resources may have been unavailable for each of the impact periods.
Root cause: Inclement weather led to a utility power incident in the region. UPS and generator backups were able to prevent impact to most of the region, as designed. One of the generators in the region did not start automatically, and required manual intervention for it to start. Once started, the impacted portion of the datacenter powered up, services/resources were verified to have come back online, and the impact was determined to be mitigated. An investigation was done on the cause of the failure to start automatically, and it was identified that the breaker servicing the starter for the generator had weakened over time, and although periodic inspection and testing was performed, on this occasion, the breaker was unable to function. Later tests showed that the breaker was intermittently failing. The generator was brought online manually, as mentioned to restore service. Shortly later, utility power was restored, and engineers worked to shift power back to utility using UPS as a bridge. The return shift of power is expected to be non-interrupting, but unexpectedly resulted in a second brief interruption in a portion of the region where the UPS batteries were not yet fully recharged and able to sustain the transfer duration. Services were fully restored once all resources were on utility power and have remained stable.
Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to): 1. Review procedures and part lifecycle on the generator systems to consider modifications for greater reliability 2. Spot check power infrastructure to ensure optimal post-incident performance 3. Implement monitoring and health checks as conditionals to transfer back to utility after an outage
Azure Machine Learning - South Central US, Southeast Asia, and West Europe
Summary of impact: Between 18:45 and 23:00 UTC on 10 Jan 2017, a subset of customers using Machine Learning in South Central US, Southeast Asia, and West Europe may have received internal server errors when performing service operations such as dataset uploading, visualizing datasets and module outputs, and/or publishing web services in AzureML Studio from new experiments.
Preliminary root cause: Engineers identified a recent configuration change as the potential root cause.
Mitigation: Engineers rolled back the recent configuration to mitigate the issue.
Next steps: Engineers will continue to investigate in order to establish an underlying root cause and prevent future occurrences.
Microsoft Azure portal - issues viewing new resources or subscriptions
Summary of impact: Between 08:00 to 22:15 UTC on 10 Jan 2017, customers may have experienced high latency when viewing recently deployed services in the Azure Portal, https://portal.azure.com . Customers may also have experienced difficulties viewing newly created subscriptions in the Azure Portal.
Workaround: Customers were able to view and manage newly provisioned services and subscriptions in the Azure Classic Portal, https://manage.windowsazure.com, or programmatically through the Command Line Interface.
Preliminary root cause: Engineers determined that this latency was due to a increased backlog of backend requests.
Mitigation: Engineers deployed a configuration change to mitigate the issue.
Next steps: Engineers will further investigate the root cause and develop ways to prevent reoccurrences of this issue.
Network Infrastructure – South India
Summary of impact: At approximately 05:33 UTC on 06 Jan 2017, concurrent fiber cuts occurred which resulted in a partial network failure in South India. The network congested on this single surviving link, causing packet loss to some customers. Azure Engineers received alarms automatically on this condition and shut down the link to resolve the issue.
Customer impact: Customers may have seen intermittent failures (packet drops) while trying to connect to their Cloud end points (Compute, Storage/SQL) in India South region. Impact duration was January 6th 5:33 UTC through January 6:09 UTC.
Root cause and mitigation: Azure South India Datacenter is connected to network backbone through dual subsea fiber channels. Due to existing dual subsea cable failures between Chennai and Singapore, network traffic towards Chennai is entering via Mumbai links causing more traffic buildup on the single segment. In this incident Azure infrastructure lost 20 Gigabytes out of 30 Gigabytes bundle between segments which caused 70% traffic drops in Azure South India. Due to this impact network traffic were not re-routed around this link, instead congesting it, which resulted in causing intermittent failures.
Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case it includes (but is not limited to): 1. As an immediate corrective action, all the 3 links have been put into the same shared link risk group. This will ensure that a partial failure will not cause impact. 2. Augment the network capacity in South India to handle this failure mode. Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://survey.microsoft.com/218692