Az Azure állapotelőzményei

Termék:

Régió:

Dátum:

2017. július

7.19

SQL Database -West Europe

Summary of impact: Between 04:00 and 10:48 UTC on 19 Jul 2017, a subset of customers using SQL Database in West Europe may have experienced issues accessing services. New connections to existing databases in this region may have resulted in an error or timeout, and existing connections may have been terminated.

Preliminary root cause: At this stage engineers do not have a definitive root cause.

Mitigation: Engineers determined that the issue was self-healed by the Azure platform.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

7.19

Service Map - East US

Summary of impact: Between 15:41 and 23:40 UTC on 18 Jul 2017, customers using Service Map in East US may not have been able to see a list of virtual machines' data when selecting the service map tile.

Preliminary root cause: Engineers are still investigating for an underlying cause; however, engineers were seeing increased loads in backend systems.

Mitigation: Engineers re-routed traffic from the affected cluster to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

7.15

RCA - Virtual Machines - East US 2

Summary of impact: Between 03:45 and 16:08 UTC on 15 July 2017, a subset of customers using Virtual Machines, and additional services that leverage the same Storage resources in East US 2 may have experienced difficulties connecting to resources hosted in this region. A very limited subset of customers experienced an extended recovery period as engineers worked to recover individual nodes that experienced residual impact beyond 16:08 UTC on 15 July, 2017. Customers that experienced a residual impact were notified directly in their Management Portal. The root cause was the result of a single scale unit losing power after a power breaker tripped during maintenance. The scale unit was on single source power during the maintenance.

Root cause and mitigation: A planned power maintenance in the datacenter was underway which requires power to become single sourced to the scale unit for a period of time during a portion of that maintenance. During that window of time, a circuit breaker along that single source experienced a trip which in turn resulted in loss of that single sourced feed until the Primary was brought back up to mitigate the issue. Due to the failure occurring mid-maintenance, safe power restore processes which take much more time than automated self-healing methods, were performed to ensure no impact to the data on the scale units occurred while moving back to Primary power feed. Following power restoration, manual health validations were performed on the storage tables of the scale unit to ensure the success of those safe restoration processes and ensure zero impact to data integrity.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Platform and our processes to help ensure such incidents do not occur in the future. In this case, we will:
1. Improve monitoring and alerting visibility when in maintenance modes [In Progress]
2. Replace the tripped breaker with a new breaker that has passed validation [Completed]
3. Provide the tripped breaker to the OEM for root cause analysis [Completed]

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

7.12

RCA - Network Infrastructure - North Central US and Central US

Summary of impact: Between 06:26 and 06:46 UTC on July 12th 2017, a subset of customers with resources in North Central US and Central US may have experienced degraded performance, network drops or time outs when accessing their Azure resources hosted in these regions. This issue may have also impacted a subset of customer VPN Gateways. Engineers have determined that an underlying network infrastructure issue caused this incident. During the impact window, a planned firmware update was in progress on WAN network devices in these regions.

Customer impact: Due to outbound connectivity failure, all inter-region traffic destined for and sourced out of these regions, including Internet and other Microsoft services was impacted during this event. All intra-region traffic would have remained unaffected. A subset of customers may have experienced impact for the following services: App Service \ Web Apps, Azure Site Recovery, Azure Backup, HDInsight, Azure SQL Database, Azure Data Lake Store, Azure Cosmos DB, Visual Studio Team Services.

Root cause and mitigation: As part of a planned upgrade on the North Central US WAN, the maintenance engineer used an automation tool to load software on the redundant WAN device in the Central US region. During the maintenance, the engineer inadvertently selected a software deployment mode that immediately invoked the device upgrade after the software was uploaded. This resulted in a brief period where both devices were reloading causing a disruption to WAN routing between North Central US, Central US regions, and the Internet. The incident was mitigated when both devices reloaded normally. Azure VPN gateways use storage blob leases in multiple regions to maintain primary election between the two instances. Blob leases from North Central and Central US regions became unavailable when the connectivity was broken. This resulted in primary election failures for the gateways, sometimes in different regions. The affected gateways lost cross-premises and VNet-to-VNet connectivity. The gateways and VPN tunnels automatically recovered once the WAN connectivity was restored.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, we will:
- Conduct an immediate review of our deployment processes and automation software to prevent the reloading of redundant devices in different regions.
- Perform an analysis of other upgrade modes to ensure that similar issues are not present in other parts of the software.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

7.10

Visual Studio Team Services - Intermittent Authentication Issues

Summary of impact: Between approximately 15:20 and 20:40 UTC on 10 Jul 2017, a subset of customers may have experienced intermittent issues when attempting to authenticate to their Visual Studio Team Service accounts.

Preliminary root cause: Engineers determined that increased authentication requests to a secondary service were throttled which prevented additional authentication requests from completing.

Mitigation: Engineers manually updated a backend service to allow throttled authentication requests to pass and performed a refresh of a backend system to mitigate previously queued requests that had not completed.

Next steps: Engineers will further determine how to prevent these throttling issues from occurring in the future. The following blog will also be updated with further information about this issue:

7.7

RCA - Network Infrastructure – Southeast Asia

Summary of impact: Between 16:21 and 17:49 UTC on 07 Jul 2017, a subset of customers in Southeast Asia may have experienced degraded performance, network connection drops or time outs when accessing their Azure resources hosted in this region. Engineers have determined that an underlying network infrastructure issue caused this incident. An update administered by an automated system caused an error in the network configuration, which gradually reduced the available network capacity, resulting in congestion and packet loss. The incident was mitigated when the system update was stopped, and the network configuration was manually updated to restore normal network conditions.

Customers may have experienced impact to the following services:
App Service (Web Apps)
Application Insights
Azure Backup
Azure SQL DB
Azure Search
Azure Storage
Event Hubs
Log Analytics
HDInsight
Redis Cache
Stream Analytics
Virtual Machines

Root cause and mitigation: The servers in the impacted datacenter in Southeast Asia are connected via a datacenter network spine with multiple routers. Automated systems updating the configuration of the routers contained a logic error that resulted in each updated router losing connectivity with the lower-level routers below the spine. Due to an unrelated and pre-existing error, the routers in the datacenter spine were incorrectly recorded as being in initial provisioning.
This caused the build-in safety checks to be skipped leading to routers continuing to be updated and loosing connectivity. This led to network capacity being gradually removed from the datacenter spine and impacting customer services in the region. Engineering teams responded to alerts and mitigated the incident by manually updating the network configuration to restore capacity.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case we will:

- Global review of all routers to ensure all production routers are recorded as being in production.
- Updating tooling and processes to ensure that the router states are correctly recorded throughout router's lifecycle.
- Update to the automated configuration system to enforce safety checks on all updates, even on non-production routers.
- Repair of the faulty logic for updating the configuration on the routers.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey.

7.4

Azure Active Directory - Germany Central and Germany Northeast

Summary of impact: Between approximately 16:15 and 18:55 UTC on 04 Jul 2017, subset of customers using Azure Active Directory in Germany Central or Germany Northeast may have experienced difficulties when attempting to authenticate into resources which are dependent on Azure Active Directory.

Preliminary root cause: Engineers determined that instances of a backend service reached an operational threshold.

Mitigation: Engineers manually added additional resources to alleviate the traffic to the backend service and allow healthy connections to resume.

Next steps: Engineers will continue to validate full recovery and understand the cause of the initial spike in traffic.

7.3

VPN Gateway - Intermittent connection failures

Summary of impact: Between 23:00 UTC on 30 Jun 2017 and 12:20 UTC on 04 Jul 2017, a subset of customers using VPN Gateway may have experienced intermittent failures when connecting to, or via, their VPN Gateway.

Preliminary root cause: Engineers determined that a recent deployment task impacted a backend service, and this in turn caused some requests to fail. Retries would likely have succeeded during this time, and a workaround of completing a double-reset of the gateway was also available. 

Mitigation: Engineers performed a change to the service configuration to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

2017. június

6.30

Log Analytics - East US

Summary of impact: Between 09:00 and 17:00 UTC on 30 Jun 2017, a subset of customers using Log Analytics in East US may have experienced log processing delays with OMS workspaces hosted in this region. Note that a limited subset of customers will still see latency up to 2 hours as the backend system continues to process data, however, metrics will stabilize after processing is complete.

Preliminary root cause: Engineers determined that instances of a backend service responsible for processing requests became unhealthy, causing a delay in data ingestion.

Mitigation: Engineers manually reconfigured the backend service and validated that the mitigation took effect. Customers would have seen new log processing without delays due to the mitigation.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

6.29

Virtual Machines and HDInsight - West Europe

Summary of impact: Between approximately 08:00 and 17:00 UTC on 29 Jun 2017, a subset of customers using Virtual Machines in West Europe may have received error notifications when provisioning Dv2-series Virtual Machines in this region. Starting Virtual Machines from a "Stopped (Deallocated)” state may have also returned errors. The availability of existing Virtual Machines was not impacted. Additionally, a subset of customers using HDInsight in West Europe may have also received intermittent deployment failure notifications when creating new HDInsight clusters in this region.

Preliminary root cause: Engineers have determined that a subset of scale units had reached an operational threshold and a backend networking device was found to be in an unhealthy state.

Mitigation: Engineers engaged in multiple workstreams to address these issues. Engineers have added additional resources to the scale units to mitigate the impact to Virtual Machine and HDInsight customers.

Next steps: Engineers are engaged and continuing to apply other long-term mitigation steps including addressing the unhealthy networking device and increasing resources for the region to fully resolve this issue.

6.27

Media Services \ Streaming - Possible streaming performance issues

Summary of impact: Between approximately 14:19 and 22:20 UTC on 27 Jun 2017, a subset of customers using Media Services \ Streaming in West US and East US may have experienced degraded performance when streaming live and on-demand media content. Channel operations may also have experienced latency or failures. Media Services Encoding was not impacted by this issue.

Preliminary root cause: Engineers determined that some instances of a backend service had reached an operational threshold, which prevented operations from completing.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.