Summary of impact: From 22:29 to 23:45 UTC on 10 Dec 2016, due to an underlying issue with the Azure Networking Infrastructure in West Europe, a subset of customers may have experienced latency or the inability to connect to Azure services hosted in the region. During this time the health of Azure resources hosted in the region was unaffected.
Preliminary root cause: Engineers are still investigating the root cause for this event. Early indications suggest the Azure platform detected a potential fault with one node that was hosted in the region.
Mitigation: Network resources performed steps to auto heal. Engineering confirmed that all networking traffic has returned to a healthy state.
Next steps: Engineering is still assessing any residual potential customer impact during this time as well as understanding the cause for the initial event. Any residual impacted customers will be contacted via their management portal (https://portal.azure.com)
Summary of impact: Between 22:30 UTC on 09 Dec 2016 and 05:10 UTC on 10 Dec 2016, customers using HDInsight in North Central US may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region. Existing clusters were not affected.
Preliminary root cause: Engineers identified a recent change in backend systems as a possible underlying cause.
Mitigation: Engineers rolled back the change to backend systems to mitigate the issue.
Next steps: Continue to investigate the underlying root cause and develop steps to mitigate future recurrences.
Summary of impact: Between approximately 10:00 and 19:00 UTC on 09 Dec 2016, customers using SQL Database in North Central US may have experienced issues performing service management operations. Server and database create, drop, rename, and change edition or performance tier operations may have resulted in an error or timeout. Availability (connecting to and using existing databases) was not impacted.
Preliminary root cause: A backend ecosystem fell into an unhealthy state.
Mitigation: Engineers manually recovered the backend ecosystem, and have confirmed that the service is now in a healthy state.
Next steps: Engineers will further investigate the root cause for this issue to prevent recurrences.
Summary of impact: Between approximately 20:00 UTC on 08 Dec 2016 and 00:47 UTC on 09 Dec 2016, customers using Virtual Machines (v2) in South Central US may have experienced failure notifications when attempting to perform networking-related update operations (such as network interfaces, NAT rules, or load balancing) to existing Virtual Machine (v2) configurations. This only affected networking-related operations on Virtual Machines (v2) as all other service management operations (such as Start, Stop, Create, Delete) on all Virtual Machines were fully functional.
Preliminary root cause: A software error on a recent deployment was determined as the underlying root cause.
Mitigation: Engineers created a new software build and deployed the update to mitigate.
Next steps: Engineers will continue to monitor for operational stability and take steps to prevent future recurrences on deployments.
Summary of impact: Between 13:22 and 16:10 UTC on 07 Dec 2016, a subset of customers using Storage in North Europe may have experienced difficulties connecting to resources hosted in this region. Retries may have succeed for some customers. Virtual Machines, App Service \ WebApps, and Azure Search customers that leveraged storage in this region may also have experienced issues.
Preliminary root cause: Initial investigations suggest the incident was caused by an underlying Storage issue, related to backend nodes incorrectly reporting resource consumption.
Mitigation: A failover operation was performed to return the system to a normal state.
Next steps: Engineers will continue to investigate to establish root cause, and prevent future occurrences. There will be no further updates as this issue is mitigated. A detailed post incident report will be published in 48-72 hours. For a limited subset of customers that may experience any residual impact, we will provide direct communication via their management portal (https://portal.azure.com).
Summary of impact: Between 19:45 and 20:05 UTC on 30 Nov 2016, customers located primarily in the Asia-Pacific geographic region may have experienced failures attempting to connect to a subset of Azure customer resources and platform services. Alerts were received, and engineers were able to correlate failures to Azure Traffic Manager services. During this time, DNS resolution requests for domain records on the Azure Traffic Manager services in the Asia-Pacific region did not receive a response, resulting in timeouts. To mitigate the impact on customers and their services, engineers removed the Asia-Pacific Traffic Manager services from the Anycast advertisement. Traffic Manager services in other regions handled these requests until the Asia-Pacific regional services were fully restored.
Customer impact: Customers experienced failures or timeouts reaching resources, sites and services which relied on Traffic Manager DNS resolution.
Root cause and mitigation: Maintenance was being completed on network devices in the Asia-Pacific region. Work to be completed increased resiliency and scalability of our network infrastructure. A misconfiguration resulted in inbound routing failure through one of the devices critical to the path of the Traffic Manager services. Telemetry used in monitoring this change in real-time was afterward determined to have been insufficient for this class of device. The engineers performing this maintenance activity received alerts for Traffic Manager, and withdrew the devices in the Asia-Pacific region mitigating the availability impact. Engineers identified the problem, reviewed the proposed fix, and implemented the fix, fully restoring Traffic Manager services in the region.
Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to): 1. Automated configuration validation pre and post change, 2.Device specific telemetry integration into change automation, to validate, pause or rollback based on real-time health signals, 3.Add geographic redundancy at the DNS name server level – Complete. Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://survey.microsoft.com/240331
Summary of impact: Between 05:15 and 05:53 on 26 Nov 2016, a subset of customers using Virtual Machines in Central US may have experienced connection failures when trying to access Virtual Machines as well as Storage availability hosted in this region. Queues, Tables, Blobs, Files, and Virtual Machines with VHDs backed by the impacted storage scale unit were unavailable for the duration of impact.
Preliminary root cause: Engineers identified an unhealthy storage component that impacted availability.
Mitigation: Systems self-healed by the Azure platform and engineers monitored metrics to ensure stability.
Next steps: Investigate the underlying cause and create mitigation steps to prevent future occurrences.
Summary of impact: Between 16:15 UTC and 22:20 UTC on 21 Nov 2016, customers attempting to use Azure Resource Manager (ARM) to create Virtual Machines (VMs) from the Microsoft Azure portal may have been unable to create ARM VMs with errors. Customers who attempted using ARM to provision new Virtual Machine resources may have been successful by using PowerShell, Azure Command-Line Interface or REST APIs. Azure Engineering investigated this incident and identified an issue with recent changes to the underlying code. Engineers deployed a hotfix which resolved the issue and ensured that Virtual Machine deployment processes returned to a healthy state.
Customer impact: Customers attempting to use Azure Resource Manager (ARM) to create Virtual Machines (VMs) from the Microsoft Azure portal may have been unable to create ARM VMs with errors. Customers who attempted using ARM to provision new Virtual Machine resources may have been successful by using PowerShell, Azure Command-Line Interface or REST APIs. Customers would have been able to use ARM within the Microsoft Azure portal to deploy other ARM enabled resources. Some customers may have experienced issues after the mitigation of this incident and mitigated by using a private browsing session to access the Microsoft Azure portal, which cleared the browser caches.
Workaround: ARM VM creation by using PowerShell, Azure Command-Line Interface or REST APIs could be used as a workaround during this incident.
Root cause and mitigation: An issue with recent changes to the underlying code. Unfortunately the change had a side effect and caused a failure while validating the location of VM at a creation. This was not detected during the testing phase due to an issue with the testing framework that didn’t catch this error scenario. We will review the testing framework to be able to catch this sort of failures in future.
Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 1. Fix the validation failure of locations while creating ARM VM – Completed. 2. Review and improve the testing framework to help ensure detecting this sort of failures in future. 3. Improve deployment methods for the portal changes to be able to apply a hotfix much faster. 4. Improve telemetry by adding more alerting around failures on key scenarios.
Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://survey.microsoft.com/153975
Summary of impact: Between 15:20 and 16:57 UTC on 16 Nov 2016, a subset of customers utilizing resources in the East US region may have experienced failures connecting to those resources. Alerts were received indicating connectivity failures to portions of the East US region. Engineers performing maintenance were able correlate those alerts with a premium storage scale unit that was undergoing a normal maintenance operation to route traffic to a redundant Software Load Balancer (SLB) instance. After routing the traffic, network connectivity failed, resulting in dependent Virtual Machines (VMs) shutting down, and dependent Azure SQL Databases (DBs) being unavailable. Engineers identified the issue impacting the SLB, updated the configuration and restored network connectivity to the premium storage scale unit. Dependent VMs, SQL DBs, and services recovered once connectivity was restored, and the impact was mitigated.
Customer impact: Customers in the East US region may have experienced timeouts or errors connecting to resources hosted in a portion of that region. VMs with system or data disks backed by impacted premium storage hosted virtual hard drives (VHDs) would have shutdown after 2 minutes of IO failure by design and restarted after connectivity was restored.
Azure SQL DBs may have been unavailable for some customers,
App Services (Web Apps) which depend on Azure SQL DB may have been unable to connect or query databases.
Azure Search services may have experienced degraded performance and unavailability.
Storage read operations would have been possible from the secondary cluster, however no workarounds for VMs or write operations were possible during this time.
Root cause and mitigation: The Software Load Balancer (SLB) service for premium storage scale unit depends on a configuration service for proper initialization. A prior incident which impacted only the configuration service left an incomplete configuration in place for this SLB instance. As part of normal maintenance operations which was performed to shift traffic to a redundant SLB service , it was found that the redundant service was not able to properly initialize from the configuration service due to the incomplete configuration. This caused the initialization and traffic which should have shifted to this SLB to fail. Specifically, the SLB services were unable to program forwarding routes correctly which resulted in the Virtual IPs served by this SLB instance to be unreachable. The services were recovered by fixing the configuration services data and triggering a configuration reload for the impacted SLB services by shifting traffic to the redundant SLB service.
Next step: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
- Improved configuration deployment mechanism is being rolled out enabling additional validation before configuration gets pushed to an SLB.
- Improvements to diagnosis and recovery to reduce impact duration.
Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://survey.microsoft.com/230176
Summary of impact: Between 19:38 and 21:05 UTC on 09 Nov 2016 A subset of customer using Visual Studio Team Services \ Build in Multiple Regions may have experienced longer than usual build times.
Preliminary root cause: Engineers have identified a recent deployment which caused latency to create Virtual Machines to process Build requests resulting in slower than usual Build processing time.
Mitigation: The deployment was rolled back which mitigated the issue.
Next steps: Continue to investigate the underlying root cause of this issue and develop a solution to prevent recurrences.