Questions? Feedback? powered by Olark live chat software
Gezintiyi Atla

Azure durum geçmişi

Ürün:

Bölge:

Tarih:

Kasım 2016

30.11

Traffic Manager impacting Multiple Services

Summary of ımpact: Between 19:45 and 20:05 UTC on 30 Nov 2016, customers and end-users located in the Asia-Pacific geographic region may have experienced failures in resolving DNS for both, Azure customer hosted services and Azure platform services.

Prelımınary root cause: Whilst engineers identified a recent maintenance activity as the preliminary root cause, a full investigation is currently underway.

Next steps: Engineers are continuing to investigate and a detailed Root Cause Analysis will be published within approximately 72 hours.

26.11

Virtual Machines - Central US

Summary of ımpact: Between 05:15 and 05:53 on 26 Nov 2016, a subset of customers using Virtual Machines in Central US may have experienced connection failures when trying to access Virtual Machines as well as Storage availability hosted in this region. Queues, Tables, Blobs, Files, and Virtual Machines with VHDs backed by the impacted storage scale unit were unavailable for the duration of impact.

Prelımınary root cause: Engineers identified an unhealthy storage component that impacted availability.

Mıtıgatıon: Systems self-healed by the Azure platform and engineers monitored metrics to ensure stability.

Next steps: Investigate the underlying cause and create mitigation steps to prevent future occurrences.

21.11

Microsoft Azure Portal – Errors using Azure Resource Manager to create Virtual Machines

Summary of ımpact: Between 16:15 UTC and 22:20 UTC on 21 Nov 2016, customers attempting to use Azure Resource Manager (ARM) to create Virtual Machines (VMs) from the Microsoft Azure portal may have been unable to create ARM VMs with errors. Customers who attempted using ARM to provision new Virtual Machine resources may have been successful by using PowerShell, Azure Command-Line Interface or REST APIs. Azure Engineering investigated this incident and identified an issue with recent changes to the underlying code. Engineers deployed a hotfix which resolved the issue and ensured that Virtual Machine deployment processes returned to a healthy state.

Customer ımpact: Customers attempting to use Azure Resource Manager (ARM) to create Virtual Machines (VMs) from the Microsoft Azure portal may have been unable to create ARM VMs with errors. Customers who attempted using ARM to provision new Virtual Machine resources may have been successful by using PowerShell, Azure Command-Line Interface or REST APIs. Customers would have been able to use ARM within the Microsoft Azure portal to deploy other ARM enabled resources. Some customers may have experienced issues after the mitigation of this incident and mitigated by using a private browsing session to access the Microsoft Azure portal, which cleared the browser caches.

Workaround: ARM VM creation by using PowerShell, Azure Command-Line Interface or REST APIs could be used as a workaround during this incident.

Root cause and mıtıgatıon: An issue with recent changes to the underlying code. Unfortunately the change had a side effect and caused a failure while validating the location of VM at a creation. This was not detected during the testing phase due to an issue with the testing framework that didn’t catch this error scenario. We will review the testing framework to be able to catch this sort of failures in future.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 1. Fix the validation failure of locations while creating ARM VM – Completed. 2. Review and improve the testing framework to help ensure detecting this sort of failures in future. 3. Improve deployment methods for the portal changes to be able to apply a hotfix much faster. 4. Improve telemetry by adding more alerting around failures on key scenarios.

Provıde feedback: Please help us improve the Azure customer communications experience by taking our survey

16.11

Multiple Services - East US

Summary of ımpact: Between 15:20 and 16:57 UTC on 16 Nov 2016, a subset of customers utilizing resources in the East US region may have experienced failures connecting to those resources. Alerts were received indicating connectivity failures to portions of the East US region. Engineers performing maintenance were able correlate those alerts with a premium storage scale unit that was undergoing a normal maintenance operation to route traffic to a redundant Software Load Balancer (SLB) instance. After routing the traffic, network connectivity failed, resulting in dependent Virtual Machines (VMs) shutting down, and dependent Azure SQL Databases (DBs) being unavailable. Engineers identified the issue impacting the SLB, updated the configuration and restored network connectivity to the premium storage scale unit. Dependent VMs, SQL DBs, and services recovered once connectivity was restored, and the impact was mitigated.

Customer ımpact: Customers in the East US region may have experienced timeouts or errors connecting to resources hosted in a portion of that region. VMs with system or data disks backed by impacted premium storage hosted virtual hard drives (VHDs) would have shutdown after 2 minutes of IO failure by design and restarted after connectivity was restored.

Azure SQL DBs may have been unavailable for some customers,
App Services (Web Apps) which depend on Azure SQL DB may have been unable to connect or query databases.
Azure Search services may have experienced degraded performance and unavailability.
Storage read operations would have been possible from the secondary cluster, however no workarounds for VMs or write operations were possible during this time.

Root cause and mıtıgatıon: The Software Load Balancer (SLB) service for premium storage scale unit depends on a configuration service for proper initialization. A prior incident which impacted only the configuration service left an incomplete configuration in place for this SLB instance. As part of normal maintenance operations which was performed to shift traffic to a redundant SLB service , it was found that the redundant service was not able to properly initialize from the configuration service due to the incomplete configuration. This caused the initialization and traffic which should have shifted to this SLB to fail. Specifically, the SLB services were unable to program forwarding routes correctly which resulted in the Virtual IPs served by this SLB instance to be unreachable. The services were recovered by fixing the configuration services data and triggering a configuration reload for the impacted SLB services by shifting traffic to the redundant SLB service.

Next step: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):

- Improved configuration deployment mechanism is being rolled out enabling additional validation before configuration gets pushed to an SLB.
- Improvements to diagnosis and recovery to reduce impact duration.

Provıde feedback: Please help us improve the Azure customer communications experience by taking our survey

9.11

Visual Studio Team Services \ Build

Summary of ımpact: Between 19:38 and 21:05 UTC on 09 Nov 2016 A subset of customer using Visual Studio Team Services \ Build in Multiple Regions may have experienced longer than usual build times.

Prelımınary root cause: Engineers have identified a recent deployment which caused latency to create Virtual Machines to process Build requests resulting in slower than usual Build processing time.

Mıtıgatıon: The deployment was rolled back which mitigated the issue.

Next steps: Continue to investigate the underlying root cause of this issue and develop a solution to prevent recurrences.

Ekim 2016

28.10

Virtual Machines - North Central US

Summary of ımpact: Between 22:42 UTC on 27 Oct 2016 and 00:01 UTC on 28 Oct 2016, a very limited subset of customers using Azure services in the North Central US region may have experienced failures or timeouts for outbound connections from their Azure services. Connectivity from within Azure resources to other Azure resources in the same region were not affected.

Prelımınary root cause: A deployment operation created an unexpected race condition which resulted in incomplete programming in a portion of the software load balancing (SLB) service. 

Mıtıgatıon: Engineers rolled back the deployment, this restored the SLB. 

Next steps: Engineers will examine the deployment code and understand why this caused the unexpected race condition.

26.10

Network Infrastructure - Multiple Impacted Services - East Asia

Summary of ımpact: Between 14:10 and 14:34 UTC, and subsequently between 15:35 and 15:52 UTC on 26 Oct 2016, a subset of customers with services hosted in East Asia experienced degraded performance, latency or time-outs when accessing their resources located in the East Asia region. New service creation may also have failed during this time. Some customer Virtual Machines (VM) were also rebooted. This was due to loss of network connectivity between a subset of network devices in East Asia. The impact to customers was mitigated by engineers performing a failover to take the incorrectly configured device out of service, which allowed network traffic to flow through to other devices.

Customer ımpact: Customers in East Asia would have experienced either partial or complete loss of connectivity to the impacted resources, and due to the loss of connectivity between Virtual Machines and storage resources, VMs would have been shut down to prevent data loss/corruption. Upon restoration of connectivity, the impacted VMs were automatically restarted.

•A subset of customers may have experienced longer recovery times for their VMs to safely start back up.
•Remote App users may not have been able to access their applications.
•WebApps customers may observed seen HTTP 500 errors.
•Customer using Event Hub may have experienced "unable to connect to service" errors.
•Stream analytics users may have seen HTTP 500 errors and would have been able to create new deployments for Stream Analytics.
•Customers using IoT services may not have been able to send telemetry data.
•Customers using Site Recovery may have unable to register their Virtual Machine Manager Servers.
•Azure SQL Database customers may have seen login failures.

Root cause and mıtıgatıon: This incident occurred during a planned regional expansion maintenance. This maintenance work includes applying configuration to devices that are to become part of the Azure network fabric. Due to a sequencing error in the plan of this maintenance, a network connection was enabled prior to it being properly configured to carry customer traffic. The premature enablement caused a large aggregate route to be announced, which prevented reachability to a subset of internal destinations. Engineers disabled the device to restore reachability.

Next steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):

1. Modifying network route aggregation policy.
2. Improving the change management methodology and validation for high risk changes.
3. Improving testing and modeling of regional scale-out changes.
Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

20.10

Microsoft Azure Portal - Multiple Regions

Summary of ımpact: Between 09:00 and 12:32 UTC on 20 Oct 2016, customers using Microsoft Azure Portal (portal.azure.com) may have seen the 'Resize', 'Diagnostics', 'Load Balancer' and 'Availability sets' buttons greyed out for Classic Virtual Machines. Deployed resources were not impacted. The options were still available through PowerShell and the Classic Portal (manage.windowsazure.com).

Prelımınary root cause: A recent deployment was causing issues for specific functions in the Azure Portal.

Mıtıgatıon: Engineers rolled back the deployment to mitigate the issue.

Next steps: Engineers will continue to investigate the underlying root cause of this issue and develop a solution to prevent recurrences.

19.10

Network connectivity issues in Japan East

Summary of ımpact: Between 07:35 and 07:44 UTC on 19 Oct 2016, a subset of customers and end-users would have experienced failures or timeouts when attempting to connect to Azure resources in the Japan East region. Engineers performing maintenance, inspecting post-maintenance signals, received platform health alerts indicating an unexpected traffic change in the region. As a result, they reverted the update, fully restoring normal traffic patterns in the region. Customers and end-users attempting to connect to Azure resources hosted in the Japan East region, originating from IP addresses in ranges from 64.0.0.1 to 127.255.255.254 would have been unsuccessful. During the 9 minutes of impact, Azure resources inside of the region remained healthy, and impacted customers and end-users would have observed a return to normal operations immediately after engineering reverted the update. Initial analysis identified that connectivity to a subset of Virtual Machines in the region had been impacted. Later, it was determined that the impact was not limited to Virtual Machines or any specific Azure service.

Prelımınary root cause: This incident occurred during regional expansion maintenance. This planned maintenance work, includes applying configuration to devices that are to become part of the Azure network fabric. An unanticipated interaction between a configuration deployed to a device, and how the network device CLI (Command Line Interface) interpreted that configuration caused network traffic from the abovementioned impacted ranges to fail to reach Azure services. This was caused by a large aggregate route that had been added to this region, which prevented reachability to a subset of internet destinations. Engineers rolled back the configuration to restore reachability.

Mıtıgatıon: Engineers rolled back the configuration to restore reachability.

Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to): Impact telemetry review, to improve timeliness and accuracy of impact assessment and notification. Review and improve the method and automation of device configuration. Implement additional validation of proposed configuration vs device interpreted configuration to catch issues before committed.

18.10

Visual Studio Team Services - Multiple Regions

Summary of ımpact: Between 17:52 UTC and 18:03 UTC on 18 Oct 2016, customers using Visual Studio Team Services in Multiple Regions experienced difficulties connecting to resources hosted in this region.

Prelımınary root cause: Engineers discovered a software error in a recent deployment.

Mıtıgatıon: Engineers rolled back the deployment, returning the service to a healthy state. More information can be found here: