Historial de estado de Azure

Producto:

Región:

Fecha:

enero de 2017

17/1

Power BI Embedded - North Europe | Investigating

At 11:21 UTC on 17 Jan 2017, Engineers received a monitoring alert for Power BI Embedded in North Europe. We have concluded our investigation of the alert and confirmed that Power BI Embedded is not affected. Further communication for Power BI can be found in the O365 management portal.
12/1

Management Portal - Virtual Machine Size Blades not visible

Summary of impact: Between 01:00 UTC 11 Jan 2017 and 23:15 UTC on 12 Jan 2017, a subset of customers may have experienced issues viewing their ‘Virtual Machine Size’ blades. Additionally, impacted customers may have been unable to make changes to the size of their Virtual Machines through the management portal (). Engineers have determined that this was only impacting customers who have created ARM Virtual Machines using a custom image. Customers should log out, and log back into the management portal if they still experience this issue to have it resolved.

Preliminary root cause: Engineers have identified a recent deployment as the root cause.

Mitigation: Engineers have developed and applied a hot fix to mitigate this issue.

Next steps: Engineers will review testing procedures to prevent future occurrences.

12/1

Microsoft Azure Portal - Issues Viewing Resources

Summary of impact: Between 22:15 UTC on 10 Jan 2017 and 01:45 UTC on 12 Jan 2017, customers may have experienced high latency when viewing recently deployed resources, or intermittently received failure notifications when attempting to deploy new resource groups or resources to new resource groups in the Azure Portal, . Customers may have seen these resources intermittently in their Portals and may have been able to intermittently manage them through PowerShell or the Command Line Interface. Additionally, customers may have encountered issues viewing existing resources in their portals. All resources continued to exist and are were running as expected.

Preliminary root cause: Engineers identified an increase in backlog requests as the potential root cause.

Mitigation: Engineers added additional resources to handle the increased backlog, as well as reverted configuration values which may affected the increase in backlog requests.

Next steps: Engineers will continue to investigate to establish an underlying root cause and prevent future occurrences.

10/1

RCA - Virtual Machines, Storage, SQL Database - West US 2

Summary of impact: From 22:09 UTC until 23:43 UTC on January 10th, a subset of customers with resources located in the West US 2 region may have experienced failures when attempting to connect to those Azure resources and platform services. Alerts were generated by Azure platform monitoring services, and engineers were able to correlate impact to a power issue in the region. To mitigate the potential loss of power, by design, generators automatically started, and began delivering power to the majority of the region. Manual mitigations were required to mitigate a power interruption in one portion of the region, and engineers successfully completed that mitigation restoring power at 22:43 UTC. After stabilizing services running on generator power, utility power returned. When utility power returned, power was switched back to UPS/utility power. During this transfer, a short power interruption occurred resulting in an additional reboot for another portion of the region from 23:22 UTC - 23:35 UTC. Services were confirmed healthy, and no further power issues were detected, but engineers continued to work to understand the cause of the two interruptions.

Customer impact: A subset of customers with resources in the West US 2 Datacenter may have experienced failures/timeouts when attempting to access their resources or perform operations. Virtual Machines may have rebooted once or twice throughout the impact period. Storage resources may have been unavailable for each of the impact periods.

Root cause: Inclement weather led to a utility power incident in the region. UPS and generator backups were able to prevent impact to most of the region, as designed. One of the generators in the region did not start automatically, and required manual intervention for it to start. Once started, the impacted portion of the datacenter powered up, services/resources were verified to have come back online, and the impact was determined to be mitigated. An investigation was done on the cause of the failure to start automatically, and it was identified that the breaker servicing the starter for the generator had weakened over time, and although periodic inspection and testing was performed, on this occasion, the breaker was unable to function. Later tests showed that the breaker was intermittently failing. The generator was brought online manually, as mentioned to restore service. Shortly later, utility power was restored, and engineers worked to shift power back to utility using UPS as a bridge. The return shift of power is expected to be non-interrupting, but unexpectedly resulted in a second brief interruption in a portion of the region where the UPS batteries were not yet fully recharged and able to sustain the transfer duration. Services were fully restored once all resources were on utility power and have remained stable.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to): 1. Review procedures and part lifecycle on the generator systems to consider modifications for greater reliability 2. Spot check power infrastructure to ensure optimal post-incident performance 3. Implement monitoring and health checks as conditionals to transfer back to utility after an outage

10/1

Azure Machine Learning - South Central US, Southeast Asia, and West Europe

Summary of impact: Between 18:45 and 23:00 UTC on 10 Jan 2017, a subset of customers using Machine Learning in South Central US, Southeast Asia, and West Europe may have received internal server errors when performing service operations such as dataset uploading, visualizing datasets and module outputs, and/or publishing web services in AzureML Studio from new experiments.

Preliminary root cause: Engineers identified a recent configuration change as the potential root cause.

Mitigation: Engineers rolled back the recent configuration to mitigate the issue.

Next steps: Engineers will continue to investigate in order to establish an underlying root cause and prevent future occurrences.

10/1

Microsoft Azure portal - issues viewing new resources or subscriptions

Summary of impact: Between 08:00 to 22:15 UTC on 10 Jan 2017, customers may have experienced high latency when viewing recently deployed services in the Azure Portal, . Customers may also have experienced difficulties viewing newly created subscriptions in the Azure Portal.

Workaround: Customers were able to view and manage newly provisioned services and subscriptions in the Azure Classic Portal, , or programmatically through the Command Line Interface.

Preliminary root cause: Engineers determined that this latency was due to a increased backlog of backend requests.

Mitigation: Engineers deployed a configuration change to mitigate the issue.

Next steps: Engineers will further investigate the root cause and develop ways to prevent reoccurrences of this issue.

6/1

Network Infrastructure – South India

Summary of impact: Between 05:42 to 06:55 UTC on 06 Jan 2017, a subset of customers using multiple services in South India may intermittently have experienced degraded performance, network drops, or time outs when accessing their Azure resources hosted in this region. Engineers have determined that this was caused by an underlying Network Infrastructure Event.

Preliminary root cause: Engineers determined that a single scale unit had reached it's acceptable operational threshold.

Mitigation: The issue was self-healed by the Azure platform.

Next steps: Engineers will continue to investigate to establish an underlying root cause and prevent future occurrences.

diciembre de 2016

23/12

Azure Functions - Region Selection Issues in Portal | Recovered

Summary of impact: Between 00:35 UTC on 22 Dec 2016 to 02:30 UTC on 23 Dec 2016, customers using Azure Functions may not have been able to select regions in the Management Portal and the Functions dashboard when creating applications. Customers provisioning new App Service \ Web Apps in Visual Studio may also have experienced the inability to select regions.

Preliminary root cause: Engineers identified a software issue within a recent deployment as the potential root cause.

Mitigation: Engineers performed manual patches and rolled out a deployment update to mitigate .

Next steps: Engineers will continue to monitor and review deployment procedures to prevent future occurrences.

22/12

Visual Studio Team Services

Summary of impact: Between 23:55 UTC on 21 Dec 2016 to 01:15 UTC on 22 Dec 2016, you were identified as a customer using Visual Studio Team Services, Visual Studio Team Services \ Build & Deployment, and Visual Studio Team Services \ Load Testing who may have intermittently experienced degraded performance and slowness while accessing accounts or navigating through Visual Studio Online workspaces.

Preliminary root cause: Engineers identified a recent configuration change as the potential root cause.

Mitigation: Engineers reverted the configuration change to mitigate the issue.

Next steps: Engineers will continue to investigate to establish an underlying root cause and prevent future occurrences.

11/12

RCA for Network Infrastructure in West Europe

Summary of impact: Between 22:29 and 23:45 UTC on the 10 Dec 2016, customers in the West Europe region may have experienced intermittent periods of connectivity issues. This included elevated packet loss and latency to other Azure regions, and inbound/outbound Internet traffic. Network traffic within the region was unaffected during this time. The connectivity loss was a result of an issue in the traffic engineering software on our network routers that failed to route traffic around a fiber issue in the network. Approximately 10% of the traffic failed to reroute to the redundant fiber path.

Customer impact: The software bug caused 10% elevated packet drop and latency to traffic to other data centers and to the Internet for a subset of the customers in West Europe.

Workaround: There is no workaround for this network issue.

Root cause and mitigation: We encountered a software issue on our network routers that caused routing calculations to take longer than expected during a fiber issue in West Europe. The path computation slowdown caused traffic to be dropped instead of moving to the redundant fiber path. Our telemetry detected the issue and we were able to shift traffic to an unaffected path before the underlying fiber problem was resolved. During this impact period, traffic through the device in West Europe was impaired. To prevent a recurrence of this issue, we have added monitoring for the software issue to alert us before the constraint can impact traffic. In addition, there is continuous work towards installing a permanent software correction for the issue.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to): 1. Improve monitoring to more rapidly indicate when a router device is in this state. Monitoring did detect the traffic issue was occurring, but did not point to the impacted device. 2. Create and load a patch for the impacted routers for the software bug. Provide feedback: Please help us improve the Azure customer communications experience by taking our survey