Azure 狀態歷程記錄

產品:

地區:

日期:

2017年7月

7/21

RCA - Network Infrastructure - UK South

Summary of impact: Between July 20, 2017 21:41 UTC and July 21, 2017 1:40 UTC, a subset of customers may have encountered connectivity failures for their resources deployed in the UK South region. Customers would have experienced errors or timeouts while accessing their resources. Upon investigation, the Azure Load Balancing team found that the data plane for one of the instances of Azure Load Balancing service in UK South region was down. A single instance of Azure Load Balancing service has multiple instances of data plane. It was noticed that all data plane instances went down in quick succession and failed repeatedly whilst trying to self-recover. The team immediately started working on the mitigation to fail over from the offending Azure Load Balancing instance to another instance of Azure Load Balancing service. This failover process was delayed due to the fact that VIP address of Azure authentication service used to secure access to any Azure production service in that region was also being served by the Azure Load Balancing service instance that went down. The Engineering teams resolved the access issue and then recovered the impacted Azure Load Balancing service instance by failing over the impacted customers to another instance of Azure Load Balancing service. The dependent services recovered gradually once the underlying load balancing service instance was recovered. Full recovery by all of the affected services was confirmed by 01:40 UTC on 21 July 2017.

Workaround: Customers who had deployed their services across multiple regions could fail out of UK South region.

Root cause and mitigation: The issue occurred when one of the instances of Azure Load Balancing service went down in the UK South region. The root cause of the issue was a bug in the Azure Load Balancing service. The issue was exposed due to a specific combination of configurations on this load balancing instance combined with a deployment specification that caused the data plane of the load balancing service to crash. There are multiple instances of data plane in a particular instance of Azure Load Balancing Service. However, due to this bug, the crash cascaded through multiple instances. The issue was recovered by failing over from the specific load balancing instance to another load balancing instance. The software bug was not detected in deployments in prior regions because it only manifested under specific combinations of the configuration in Azure Load Balancing services. The combination of configurations that exposed this bug was addressed by recovering the Azure Load Balancing service instance.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, we will: 1. Roll out a fix to the bug which caused Azure Load Balancing instance data plane to crash. In the interim a temporary mitigation has been applied to prevent this bug from resurfacing in any other region. 2. Improve test coverage for the specific combination of configuration that exposed the bug. 3. Address operational issues for Azure Authentication services break-glass scenarios.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

7/20

App Service/Web apps - West Europe

Summary of impact: Between 12:00 and 15:09 UTC on 20 Jul 2017, a subset of customers using App Service \ Web Apps in West Europe may have experienced timeouts or high latency when accessing Web App deployments hosted in this region.

Preliminary root cause: Engineers determined that a spike in traffic was preventing requests from completing.

Mitigation: Engineers determined that the issue was self-healed by the Azure platform.

Next steps: Engineers will review testing procedures to prevent future reoccurrences.

7/19

RCA - SQL Database -West Europe

Summary of impact: Between 07:44 to 10:30 UTC on 19 Jul 2017, customers using SQL Database in West Europe which have Auditing turned on, may have experienced issues accessing services. New connections to existing databases in this region may have resulted in an error or timeout, and existing connections may have been terminated. Customers using Table Auditing (secure connection string), may not had any auditing, and experienced queries latencies and intermittent disconnects. Engineering initially understood this issue to be between 04:00 and 10:48 UTC, however upon further investigation it was deemed that the impact window was narrower, starting at 07:44 and ending at 10:30 UTC.

Workaround: During this time customers could have switched to Blob Auditing for better auditing and threat detection. This would have also had an improved effect on latency and performance.

Root cause and mitigation: On 19 July 2017 07:44 UTC, a scheduled storage key rotation took place on the SQL Table Auditing component in West Europe. The key rotation process was meant to be transparent to customers, however due to a configuration error, an incorrect key was applied. This resulted in latencies and intermittent disconnects for customers using the table auditing. Shortly after rotation started, the monitoring system triggered alerts, and engineers worked on mitigating the incident by re-configuring the component with the correct key. Full recovery occurred at approximately 10:30 UTC. As part of the investigation, gaps have been detected in the ability of the component to handle such situations as well the ability to faster recover from it. In addition, the rotation process was stopped worldwide until a more robust process is put in place.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1)Improve the key rotation process to ensure no impact.
2)Improve component functionality when storage access is broken.
3)Improve re-configuration method for faster mitigation.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

7/19

Service Map - East US

Summary of impact: Between 15:41 and 23:40 UTC on 18 Jul 2017, customers using Service Map in East US may not have been able to see a list of virtual machines' data when selecting the service map tile.

Preliminary root cause: Engineers are still investigating for an underlying cause; however, engineers were seeing increased loads in backend systems.

Mitigation: Engineers re-routed traffic from the affected cluster to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

7/15

RCA - Virtual Machines - East US 2

Summary of impact: Between 03:45 and 16:08 UTC on 15 July 2017, a subset of customers using Virtual Machines, and additional services that leverage the same Storage resources in East US 2 may have experienced difficulties connecting to resources hosted in this region. A very limited subset of customers experienced an extended recovery period as engineers worked to recover individual nodes that experienced residual impact beyond 16:08 UTC on 15 July, 2017. Customers that experienced a residual impact were notified directly in their Management Portal. The root cause was the result of a single scale unit losing power after a power breaker tripped during maintenance. The scale unit was on single source power during the maintenance.

Root cause and mitigation: A planned power maintenance in the datacenter was underway which requires power to become single sourced to the scale unit for a period of time during a portion of that maintenance. During that window of time, a circuit breaker along that single source experienced a trip which in turn resulted in loss of that single sourced feed until the Primary was brought back up to mitigate the issue. Due to the failure occurring mid-maintenance, safe power restore processes which take much more time than automated self-healing methods, were performed to ensure no impact to the data on the scale units occurred while moving back to Primary power feed. Following power restoration, manual health validations were performed on the storage tables of the scale unit to ensure the success of those safe restoration processes and ensure zero impact to data integrity.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Platform and our processes to help ensure such incidents do not occur in the future. In this case, we will:
1. Improve monitoring and alerting visibility when in maintenance modes [In Progress]
2. Replace the tripped breaker with a new breaker that has passed validation [Completed]
3. Provide the tripped breaker to the OEM for root cause analysis [Completed]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

7/12

RCA - Network Infrastructure - North Central US and Central US

Summary of impact: Between 06:26 and 06:46 UTC on July 12th 2017, a subset of customers with resources in North Central US and Central US may have experienced degraded performance, network drops or time outs when accessing their Azure resources hosted in these regions. This issue may have also impacted a subset of customer VPN Gateways. Engineers have determined that an underlying network infrastructure issue caused this incident. During the impact window, a planned firmware update was in progress on WAN network devices in these regions.

Customer impact: Due to outbound connectivity failure, all inter-region traffic destined for and sourced out of these regions, including Internet and other Microsoft services was impacted during this event. All intra-region traffic would have remained unaffected. A subset of customers may have experienced impact for the following services: App Service \ Web Apps, Azure Site Recovery, Azure Backup, HDInsight, Azure SQL Database, Azure Data Lake Store, Azure Cosmos DB, Visual Studio Team Services.

Root cause and mitigation: As part of a planned upgrade on the North Central US WAN, the maintenance engineer used an automation tool to load software on the redundant WAN device in the Central US region. During the maintenance, the engineer inadvertently selected a software deployment mode that immediately invoked the device upgrade after the software was uploaded. This resulted in a brief period where both devices were reloading causing a disruption to WAN routing between North Central US, Central US regions, and the Internet. The incident was mitigated when both devices reloaded normally. Azure VPN gateways use storage blob leases in multiple regions to maintain primary election between the two instances. Blob leases from North Central and Central US regions became unavailable when the connectivity was broken. This resulted in primary election failures for the gateways, sometimes in different regions. The affected gateways lost cross-premises and VNet-to-VNet connectivity. The gateways and VPN tunnels automatically recovered once the WAN connectivity was restored.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, we will:
- Conduct an immediate review of our deployment processes and automation software to prevent the reloading of redundant devices in different regions.
- Perform an analysis of other upgrade modes to ensure that similar issues are not present in other parts of the software.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

7/10

Visual Studio Team Services - Intermittent Authentication Issues

Summary of impact: Between approximately 15:20 and 20:40 UTC on 10 Jul 2017, a subset of customers may have experienced intermittent issues when attempting to authenticate to their Visual Studio Team Service accounts.

Preliminary root cause: Engineers determined that increased authentication requests to a secondary service were throttled which prevented additional authentication requests from completing.

Mitigation: Engineers manually updated a backend service to allow throttled authentication requests to pass and performed a refresh of a backend system to mitigate previously queued requests that had not completed.

Next steps: Engineers will further determine how to prevent these throttling issues from occurring in the future. The following blog will also be updated with further information about this issue:

7/7

RCA - Network Infrastructure – Southeast Asia

Summary of impact: Between 16:21 and 17:49 UTC on 07 Jul 2017, a subset of customers in Southeast Asia may have experienced degraded performance, network connection drops or time outs when accessing their Azure resources hosted in this region. Engineers have determined that an underlying network infrastructure issue caused this incident. An update administered by an automated system caused an error in the network configuration, which gradually reduced the available network capacity, resulting in congestion and packet loss. The incident was mitigated when the system update was stopped, and the network configuration was manually updated to restore normal network conditions.

Customers may have experienced impact to the following services:
App Service (Web Apps)
Application Insights
Azure Backup
Azure SQL DB
Azure Search
Azure Storage
Event Hubs
Log Analytics
HDInsight
Redis Cache
Stream Analytics
Virtual Machines

Root cause and mitigation: The servers in the impacted datacenter in Southeast Asia are connected via a datacenter network spine with multiple routers. Automated systems updating the configuration of the routers contained a logic error that resulted in each updated router losing connectivity with the lower-level routers below the spine. Due to an unrelated and pre-existing error, the routers in the datacenter spine were incorrectly recorded as being in initial provisioning.
This caused the build-in safety checks to be skipped leading to routers continuing to be updated and loosing connectivity. This led to network capacity being gradually removed from the datacenter spine and impacting customer services in the region. Engineering teams responded to alerts and mitigated the incident by manually updating the network configuration to restore capacity.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case we will:

- Global review of all routers to ensure all production routers are recorded as being in production.
- Updating tooling and processes to ensure that the router states are correctly recorded throughout router's lifecycle.
- Update to the automated configuration system to enforce safety checks on all updates, even on non-production routers.
- Repair of the faulty logic for updating the configuration on the routers.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey.

7/4

Azure Active Directory - Germany Central and Germany Northeast

Summary of impact: Between approximately 16:15 and 18:55 UTC on 04 Jul 2017, subset of customers using Azure Active Directory in Germany Central or Germany Northeast may have experienced difficulties when attempting to authenticate into resources which are dependent on Azure Active Directory.

Preliminary root cause: Engineers determined that instances of a backend service reached an operational threshold.

Mitigation: Engineers manually added additional resources to alleviate the traffic to the backend service and allow healthy connections to resume.

Next steps: Engineers will continue to validate full recovery and understand the cause of the initial spike in traffic.

7/3

VPN Gateway - Intermittent connection failures

Summary of impact: Between 23:00 UTC on 30 Jun 2017 and 12:20 UTC on 04 Jul 2017, a subset of customers using VPN Gateway may have experienced intermittent failures when connecting to, or via, their VPN Gateway.

Preliminary root cause: Engineers determined that a recent deployment task impacted a backend service, and this in turn caused some requests to fail. Retries would likely have succeeded during this time, and a workaround of completing a double-reset of the gateway was also available. 

Mitigation: Engineers performed a change to the service configuration to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

2017年6月

6/30

Log Analytics - East US

Summary of impact: Between 09:00 and 17:00 UTC on 30 Jun 2017, a subset of customers using Log Analytics in East US may have experienced log processing delays with OMS workspaces hosted in this region. Note that a limited subset of customers will still see latency up to 2 hours as the backend system continues to process data, however, metrics will stabilize after processing is complete.

Preliminary root cause: Engineers determined that instances of a backend service responsible for processing requests became unhealthy, causing a delay in data ingestion.

Mitigation: Engineers manually reconfigured the backend service and validated that the mitigation took effect. Customers would have seen new log processing without delays due to the mitigation.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.