Azure の状態の履歴

製品:

リージョン:

日付:

2017年6月

6/23

App Service \ Web Apps - West US 2

Ummary of impact: Between 06:12 and 08:16 UTC on 23 Jun 2017, a subset of customers using App Service \ Web Apps in West US 2 may have experienced difficulties connecting to resources hosted in this region.

Preliminary root cause: Engineers identified a recent deployment task as the potential root cause.

Mitigation: Engineers rolled back the recent deployment task to mitigate the issue.

Next steps: Engineers will review deployment procedures to prevent future occurrences.

6/17

Visual Studio Application Insights - Data Access Issues

Between 22:59 and 23:59 UTC on 16 Jun 2017, customers using Application Insights in East US, North Europe, South Central US and West Europe may have experienced Data Access Issues in the Azure and Application Analytics portal. The following data types were affected: Availability, Customer Event, Dependency, Exception, Metric, Page Load, Page View, Performance Counter, Request, Trace. Please refer to the Application Insights Status Blog for more information: .

Preliminary root cause: Engineers identified a backend configuration error as the potential root cause.

Mitigation: Engineers manually reconfigured the backend service to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

6/11

RCA - Network Infrastructure Impacting Multiple Services - Australia East

Summary of impact: Between 1:48 and 5:15 UTC on June 11, 2017, a subset of customers may have encountered reachability failures for some of their resources in Australia East. Starting at 1:48 UTC, outbound network connectivity in the region was interrupted for a subset of customers, which caused DNS lookups and Storage connections to fail and as a result some customer Virtual Machines shut down and restarted. Engineers investigated and found that a deployment of a new instance of the Azure Load Balancer service was underway in the region. At 3:40 UTC, the new deployment was disabled, after which customer impact for networking services was mitigated. A subset of impacted services experienced a delayed recovery, and engineers confirmed all services were fully mitigated at 5:15 UTC. Deployment activities were paused until a full RCA could be completed, and necessary repairs resolved.

Customer impact: Due to outbound connectivity failure, the impacted services which are dependent on resources internal and/or external to Azure would have seen reachability failures for the duration of this incident. Customer Virtual Machines would have shut down as designed after losing connectivity to the correlating storage services. A subset of customers would have experienced the following impact for affected services:
Redis Cache - unable to connect to Redis caches
Azure Search - search service unavailable
Azure Media Services - media streaming failures
App Services - availability close to 0%
Backup - backup management operation failures
Azure Stream Analytics - cannot create jobs
Azure Site Recovery - protection and failover failures
Visual Studio Team Services - connection failures
Azure Key Vault - connection failures

Root cause and mitigation: A deployment of configuration settings for the Azure Load Balancer service caused an outage when a component of the new deployment began advertising address ranges already advertised by an existing instance of the Azure Load Balancer service. The new deployment was a shadow environment specifically configured to be inactive (and hence, not to advertise any address ranges) in terms of customer traffic. Because the route advertisement from this new service was more specific than the existing route advertisement, traffic was diverted to the new inactive deployment. When this occurred, outbound connectivity was interrupted for a subset of services in the region. The event was triggered by a software bug within the new deployment. The new deployment was passively ingesting state from the existing deployment and validating it. A configuration setting was in place to prevent this ingested state to go beyond validation and result in route advertisement. Due the software bug, this configuration setting failed. The software bug was not detected in deployments in prior regions because it only manifested under specific combinations of the services own address range configuration. Australia East was the first time this combination occurred.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, we will:
1. Repair the software bug to ensure address ranges marked as DO NOT advertise are NOT advertised.
2. Implement additional automatic rollback capabilities of deployments based on health probe for data plane impacting deployments (control plane impacting deployments are already automatically rolled back).
3. Ensure all Azure Load Balancer deployments pass address-range health checks before reaching production.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

6/10

Log Analytics - West Europe

Summary of impact: Between 22:50 UTC on 09 Jun 2017 and 01:20 UTC on 10 Jun 2017, customers using Log Analytics in West Europe may have experienced difficulties when trying to login to the OMS Portal () or when connecting to resources hosted in this region. The OMS Solutions within West Europe workspaces may have failed to properly load and display data. This includes OMS Solutions for Service Map, Insight & Analytics, Security & Compliance, Protection & Recovery, and Automation & Control offerings for OMS.

Preliminary root cause: Engineers determined that some instances of a backend service responsible for processing service management requests had reached an operational threshold, preventing requests from completing.

Mitigation: Engineers scaled out backend instances to increase service bandwidth and mitigate the issue.

6/7

Automation - East US 2

Summary of impact: Between 11:00 and 16:15 UTC on 07 Jun 2017, a limited subset of customers using Automation in East US 2 may have experienced intermittent issues viewing accounts, schedules, assets or starting jobs in the Azure Portal (). Any clients using APIs would have also been impacted, such as PowerShell or SDKs. Jobs which were already submitted would have seen a delayed execution and would have eventually been processed. These start operations should not have been resubmitted. This issue could have impacted up to 3% of customers in the region.

Preliminary root cause: Engineers identified a query that consumed CPU resources on a single backend database, which impacted other queries from completing successfully.

Mitigation: Engineers performed a failover to a secondary database to start with fresh resources and optimized configurations on the offending query.

Next steps: Engineers will continue to optimize query configurations and investigate the underlying cause to prevent future occurrences.

6/3

Root Cause Analysis - ExpressRoute \ ExpressRoute Circuits Amsterdam

Summary of impact: Between 21:03 UTC on 01 Jun 2017 and 22:04 UTC on the 02 Jun 2017, ExpressRoute customers connecting to Azure services through ExpressRoute public peering in Amsterdam may have experienced a loss in network connectivity. Customers with redundant connectivity configurations may have experienced a partial loss in connectivity. This was caused by an ExpressRoute router configuration size which overloaded router resources. Once the issue was detected, configuration was partitioned across multiple devices to mitigate the issue. During this time customers could have failed over to using a 2nd ExpressRoute circuit in the geo or could have failed back to the internet.

Root cause and mitigation: A subset of ExpressRoute edge routers in Amsterdam were overloaded due to the size of configuration. The source NAT on the public peering was overloaded and resulted in loss of connectivity. ExpressRoute routers are redundant in an active/active configuration. A device failover restored connectivity for some time during this period. Upon failover, the second device also got affected by overload and experienced impact.

Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to): 1. [Ongoing] Prevent configuration overload on devices. 2. [Ongoing] Reinforce existing recommend that all customers deploy ExpressRoute with an active-active configuration. 3. [Ongoing] Enhance router resource monitoring to alert before resource limits are reached. 4. [Coming] Support ExpressRoute public peering prefixes on MSFT peering that does not use device NAT. 5. [Coming] Add ability to re-balance load automatically. Direct new connections to new routers in the same location.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

6/1

Issue with accessing Azure services through Mozilla Firefox

Summary of impact: From approximately 20:18 UTC on 28 May 2017 to 03:00 UTC on 1 June 2017, customers leveraging Mozilla Firefox may have been unable to access Azure services, including Azure Management Portal (), Web Apps, Azure Data Lake Analytics, Azure Data Lake Store, Visual Studio Team Services, Azure Service Fabric, Service Bus, and Storage. As a workaround, Firefox users had to use an alternative browser such as Internet Explorer, Safari, Edge, or Chrome.

Root cause and mitigation: This incident was caused by a bug in the service that verifies status of Microsoft services’ certificates via the OCSP protocol. In a rare combination of circumstances this service generated OCSP responses whose validity period exceeded that of the certificate that signed the responses. This issue was detected and fixed. However, the previously generated responses were cached on Azure servers. These cached responses were served to clients via a feature called OCSP stapling that is used commonly to improve user experience. Firefox clients interpreted these OCSP responses as invalid and displayed an authentication failure. Other clients succeeded because they re-tried by directly calling the OCSP service, which had been fixed by then.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 1. Fix the bug in the OCSP service that resulted in OCSP responses with validity longer than the certificate that signs the responses. 2. Add validation to servers that implement OCSP stapling, to ensure that invalid OCSP responses are not passed through to clients.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

2017年5月

5/31

Service Management Operations Failures - Multiple Regions

Summary of impact: Between 18:35 and 19:54 UTC on 31 May 2017, customers experienced Service Management operation failures in North Central US, Central US, West US, West US 2, East US, East US 2, Canada East, and Canada Central, with resources containing permissions inherited from security groups. Customers could have experienced errors when viewing resources in the Azure Portal.

Preliminary root cause: At this stage engineers do not have a definitive root cause.

Mitigation: Engineers manually increased backend processes to mitigate the incident.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

5/31

App Service \ Web Apps - West Europe

Summary of impact: Between 22:50 and 23:30 UTC on 30 May 2017, a subset of customers using App Service \ Web Apps in West Europe may have received intermittent HTTP 5xx errors, timeouts or have experienced high latency when accessing Web Apps deployments hosted in this region.

Preliminary root cause: Engineering have determined that the issue was caused by an underlying Storage issue.

Mitigation: The issue was self-healed by the Azure platform.

Next steps: Engineers will continue to monitor the health of Web Apps in the region and look to establish the cause of the Storage issue.

5/29

IoT Hub – Error message when deploying

Summary of impact: Between 00:00 UTC on 26 May 2017 and 16:30 UTC on 29 May 2017, a subset of customers using Azure IoT Hub may have received the error message "Cannot read property 'value' of undefined or null reference" when trying to deploy IoT Hub resources. This issue only affected new subscriptions attempting to deploy their first IoT Hub - existing IoT Hubs were not affected. Customers could deploy IoT Hubs using Azure Resource Manager (ARM) templates as a workaround.

Preliminary root cause: Engineers determined that a recent deployment introduced a new software task that was not properly validating new subscriptions.

Mitigation: Engineers deployed a platform hotfix to bypass this software task and mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

2017年4月

4/11

IoT Suite - Failures Provisioning New Solutions - Germany

Summary of impact: Between 07:15 on 08 Apr 2017 and 04:00 UTC on 11 Apr 2017, customers using Azure IoT Suite may have been unable to provision solutions. Engineers recommended deploying from an MSbuild prompt using code at . Existing resources were not impacted.

Preliminary root cause: Engineers identified a recent change to backend systems as the preliminary root cause.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.