Summary of impact: Between approximately 20:00 UTC on 08 Dec 2016 and 00:47 UTC on 09 Dec 2016, customers using Virtual Machines (v2) in South Central US may have experienced failure notifications when attempting to perform networking-related update operations (such as network interfaces, NAT rules, or load balancing) to existing Virtual Machine (v2) configurations. This only affected networking-related operations on Virtual Machines (v2) as all other service management operations (such as Start, Stop, Create, Delete) on all Virtual Machines were fully functional.
Preliminary root cause: A software error on a recent deployment was determined as the underlying root cause.
Mitigation: Engineers created a new software build and deployed the update to mitigate.
Next steps: Engineers will continue to monitor for operational stability and take steps to prevent future recurrences on deployments.
Summary of impact: Between 13:22 and 16:10 UTC on 07 Dec 2016, a subset of customers using Storage in North Europe may have experienced difficulties connecting to resources hosted in this region. Retries may have succeed for some customers. Virtual Machines, App Service \ WebApps, and Azure Search customers that leveraged storage in this region may also have experienced issues.
Preliminary root cause: Initial investigations suggest the incident was caused by an underlying Storage issue, related to backend nodes incorrectly reporting resource consumption.
Mitigation: A failover operation was performed to return the system to a normal state.
Next steps: Engineers will continue to investigate to establish root cause, and prevent future occurrences. There will be no further updates as this issue is mitigated. A detailed post incident report will be published in 48-72 hours. For a limited subset of customers that may experience any residual impact, we will provide direct communication via their management portal (https://portal.azure.com).
Summary of impact: Between 19:45 and 20:05 UTC on 30 Nov 2016, customers located primarily in the Asia-Pacific geographic region may have experienced failures attempting to connect to a subset of Azure customer resources and platform services. Alerts were received, and engineers were able to correlate failures to Azure Traffic Manager services. During this time, DNS resolution requests for domain records on the Azure Traffic Manager services in the Asia-Pacific region did not receive a response, resulting in timeouts. To mitigate the impact on customers and their services, engineers removed the Asia-Pacific Traffic Manager services from the Anycast advertisement. Traffic Manager services in other regions handled these requests until the Asia-Pacific regional services were fully restored.
Customer impact: Customers experienced failures or timeouts reaching resources, sites and services which relied on Traffic Manager DNS resolution.
Root cause and mitigation: Maintenance was being completed on network devices in the Asia-Pacific region. Work to be completed increased resiliency and scalability of our network infrastructure. A misconfiguration resulted in inbound routing failure through one of the devices critical to the path of the Traffic Manager services. Telemetry used in monitoring this change in real-time was afterward determined to have been insufficient for this class of device. The engineers performing this maintenance activity received alerts for Traffic Manager, and withdrew the devices in the Asia-Pacific region mitigating the availability impact. Engineers identified the problem, reviewed the proposed fix, and implemented the fix, fully restoring Traffic Manager services in the region.
Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to): 1. Automated configuration validation pre and post change, 2.Device specific telemetry integration into change automation, to validate, pause or rollback based on real-time health signals, 3.Add geographic redundancy at the DNS name server level – Complete. Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://survey.microsoft.com/240331
Summary of impact: Between 05:15 and 05:53 on 26 Nov 2016, a subset of customers using Virtual Machines in Central US may have experienced connection failures when trying to access Virtual Machines as well as Storage availability hosted in this region. Queues, Tables, Blobs, Files, and Virtual Machines with VHDs backed by the impacted storage scale unit were unavailable for the duration of impact.
Preliminary root cause: Engineers identified an unhealthy storage component that impacted availability.
Mitigation: Systems self-healed by the Azure platform and engineers monitored metrics to ensure stability.
Next steps: Investigate the underlying cause and create mitigation steps to prevent future occurrences.
Summary of impact: Between 16:15 UTC and 22:20 UTC on 21 Nov 2016, customers attempting to use Azure Resource Manager (ARM) to create Virtual Machines (VMs) from the Microsoft Azure portal may have been unable to create ARM VMs with errors. Customers who attempted using ARM to provision new Virtual Machine resources may have been successful by using PowerShell, Azure Command-Line Interface or REST APIs. Azure Engineering investigated this incident and identified an issue with recent changes to the underlying code. Engineers deployed a hotfix which resolved the issue and ensured that Virtual Machine deployment processes returned to a healthy state.
Customer impact: Customers attempting to use Azure Resource Manager (ARM) to create Virtual Machines (VMs) from the Microsoft Azure portal may have been unable to create ARM VMs with errors. Customers who attempted using ARM to provision new Virtual Machine resources may have been successful by using PowerShell, Azure Command-Line Interface or REST APIs. Customers would have been able to use ARM within the Microsoft Azure portal to deploy other ARM enabled resources. Some customers may have experienced issues after the mitigation of this incident and mitigated by using a private browsing session to access the Microsoft Azure portal, which cleared the browser caches.
Workaround: ARM VM creation by using PowerShell, Azure Command-Line Interface or REST APIs could be used as a workaround during this incident.
Root cause and mitigation: An issue with recent changes to the underlying code. Unfortunately the change had a side effect and caused a failure while validating the location of VM at a creation. This was not detected during the testing phase due to an issue with the testing framework that didn’t catch this error scenario. We will review the testing framework to be able to catch this sort of failures in future.
Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 1. Fix the validation failure of locations while creating ARM VM – Completed. 2. Review and improve the testing framework to help ensure detecting this sort of failures in future. 3. Improve deployment methods for the portal changes to be able to apply a hotfix much faster. 4. Improve telemetry by adding more alerting around failures on key scenarios.
Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://survey.microsoft.com/153975
Summary of impact: Between 15:20 and 16:57 UTC on 16 Nov 2016, a subset of customers utilizing resources in the East US region may have experienced failures connecting to those resources. Alerts were received indicating connectivity failures to portions of the East US region. Engineers performing maintenance were able correlate those alerts with a premium storage scale unit that was undergoing a normal maintenance operation to route traffic to a redundant Software Load Balancer (SLB) instance. After routing the traffic, network connectivity failed, resulting in dependent Virtual Machines (VMs) shutting down, and dependent Azure SQL Databases (DBs) being unavailable. Engineers identified the issue impacting the SLB, updated the configuration and restored network connectivity to the premium storage scale unit. Dependent VMs, SQL DBs, and services recovered once connectivity was restored, and the impact was mitigated.
Customer impact: Customers in the East US region may have experienced timeouts or errors connecting to resources hosted in a portion of that region. VMs with system or data disks backed by impacted premium storage hosted virtual hard drives (VHDs) would have shutdown after 2 minutes of IO failure by design and restarted after connectivity was restored.
Azure SQL DBs may have been unavailable for some customers,
App Services (Web Apps) which depend on Azure SQL DB may have been unable to connect or query databases.
Azure Search services may have experienced degraded performance and unavailability.
Storage read operations would have been possible from the secondary cluster, however no workarounds for VMs or write operations were possible during this time.
Root cause and mitigation: The Software Load Balancer (SLB) service for premium storage scale unit depends on a configuration service for proper initialization. A prior incident which impacted only the configuration service left an incomplete configuration in place for this SLB instance. As part of normal maintenance operations which was performed to shift traffic to a redundant SLB service , it was found that the redundant service was not able to properly initialize from the configuration service due to the incomplete configuration. This caused the initialization and traffic which should have shifted to this SLB to fail. Specifically, the SLB services were unable to program forwarding routes correctly which resulted in the Virtual IPs served by this SLB instance to be unreachable. The services were recovered by fixing the configuration services data and triggering a configuration reload for the impacted SLB services by shifting traffic to the redundant SLB service.
Next step: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
- Improved configuration deployment mechanism is being rolled out enabling additional validation before configuration gets pushed to an SLB.
- Improvements to diagnosis and recovery to reduce impact duration.
Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://survey.microsoft.com/230176
Summary of impact: Between 19:38 and 21:05 UTC on 09 Nov 2016 A subset of customer using Visual Studio Team Services \ Build in Multiple Regions may have experienced longer than usual build times.
Preliminary root cause: Engineers have identified a recent deployment which caused latency to create Virtual Machines to process Build requests resulting in slower than usual Build processing time.
Mitigation: The deployment was rolled back which mitigated the issue.
Next steps: Continue to investigate the underlying root cause of this issue and develop a solution to prevent recurrences.
Summary of impact: Between 22:42 UTC on 27 Oct 2016 and 00:01 UTC on 28 Oct 2016, a very limited subset of customers using Azure services in the North Central US region may have experienced failures or timeouts for outbound connections from their Azure services. Connectivity from within Azure resources to other Azure resources in the same region were not affected.
Preliminary root cause: A deployment operation created an unexpected race condition which resulted in incomplete programming in a portion of the software load balancing (SLB) service.
Mitigation: Engineers rolled back the deployment, this restored the SLB.
Next steps: Engineers will examine the deployment code and understand why this caused the unexpected race condition.
Summary of impact: Between 14:10 and 14:34 UTC, and subsequently between 15:35 and 15:52 UTC on 26 Oct 2016, a subset of customers with services hosted in East Asia experienced degraded performance, latency or time-outs when accessing their resources located in the East Asia region. New service creation may also have failed during this time. Some customer Virtual Machines (VM) were also rebooted. This was due to loss of network connectivity between a subset of network devices in East Asia. The impact to customers was mitigated by engineers performing a failover to take the incorrectly configured device out of service, which allowed network traffic to flow through to other devices.
Customer impact: Customers in East Asia would have experienced either partial or complete loss of connectivity to the impacted resources, and due to the loss of connectivity between Virtual Machines and storage resources, VMs would have been shut down to prevent data loss/corruption. Upon restoration of connectivity, the impacted VMs were automatically restarted.
•A subset of customers may have experienced longer recovery times for their VMs to safely start back up.
•Remote App users may not have been able to access their applications.
•WebApps customers may observed seen HTTP 500 errors.
•Customer using Event Hub may have experienced "unable to connect to service" errors.
•Stream analytics users may have seen HTTP 500 errors and would have been able to create new deployments for Stream Analytics.
•Customers using IoT services may not have been able to send telemetry data.
•Customers using Site Recovery may have unable to register their Virtual Machine Manager Servers.
•Azure SQL Database customers may have seen login failures.
Root cause and mitigation: This incident occurred during a planned regional expansion maintenance. This maintenance work includes applying configuration to devices that are to become part of the Azure network fabric. Due to a sequencing error in the plan of this maintenance, a network connection was enabled prior to it being properly configured to carry customer traffic. The premature enablement caused a large aggregate route to be announced, which prevented reachability to a subset of internal destinations. Engineers disabled the device to restore reachability.
Next steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
1. Modifying network route aggregation policy.
2. Improving the change management methodology and validation for high risk changes.
3. Improving testing and modeling of regional scale-out changes.
Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://survey.microsoft.com/207400
Summary of impact: Between 09:00 and 12:32 UTC on 20 Oct 2016, customers using Microsoft Azure Portal (portal.azure.com) may have seen the 'Resize', 'Diagnostics', 'Load Balancer' and 'Availability sets' buttons greyed out for Classic Virtual Machines. Deployed resources were not impacted. The options were still available through PowerShell and the Classic Portal (manage.windowsazure.com).
Preliminary root cause: A recent deployment was causing issues for specific functions in the Azure Portal.
Mitigation: Engineers rolled back the deployment to mitigate the issue.
Next steps: Engineers will continue to investigate the underlying root cause of this issue and develop a solution to prevent recurrences.