Historia stanu platformy Azure

Produkt:

Region:

Data:

czerwiec 2017

27-6

Media Services \ Streaming - Possible streaming performance issues

Summary of impact: Between approximately 14:19 and 22:20 UTC on 27 Jun 2017, a subset of customers using Media Services \ Streaming in West US and East US may have experienced degraded performance when streaming live and on-demand media content. Channel operations may also have experienced latency or failures. Media Services Encoding was not impacted by this issue.

Preliminary root cause: Engineers determined that some instances of a backend service had reached an operational threshold, which prevented operations from completing.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

26-6

Network Infrastructure - Korea Central

Summary of impact: Between 01:12 and 02:22 UTC, also between 10:40 and 11:17 UTC on 26 Jun 2017, a subset of customers in Korea Central may have experienced degraded performance, network drops or time-outs when accessing their Azure resources hosted in this region. Engineers have determined that this was caused by an underlying Network Infrastructure event. Customers may have experienced issues connecting to other services as a result of this, including: Storage, Key Vault and Cosmos DB.

Preliminary root cause: Engineers determined that a network device became unhealthy.

Mitigation: Initially, engineers manually re-directed traffic from the unhealthy network device to mitigate the issue. Subsequently, a manual restart and patch of a backend service was performed; engineers are continuing to monitor system health.

Next steps: Engineers will continue to investigate to establish the full root cause.

24-6

RCA - Networking issue with impact to additional services

Summary of impact: Between 09:03 and 11:14 UTC on 24 Jun 2017, a subset of customers may have experienced intermittent connectivity failures when accessing Azure resources across multiple Azure regions. The failure emerged from West US, but impacted global internet routing. Traffic between Azure sites was not impacted during this event.

Customer impact: Customers would have experienced packet loss and latency to Internet locations globally. Communications between Microsoft Azure sites was not impacted during this event.

Root cause and mitigation: Microsoft Azure’s software failed to immediately detect and mitigate intermittent packet loss and slow connectivity for Azure customers in multiple regions. Azure monitoring software eventually detected this condition, and Azure Engineers manually shut down the Border Gateway Protocol (BGP) and physical connectivity to a Microsoft Azure partner to resolve the situation. An investigation showed that this issue was caused by misconfigured routing tables being shared by a partner. We determined that during maintenance at an Internet peering point, the partner had a misconfiguration on their network, redirecting multiple Internet routes for their transit internet service providers (ISPs). Microsoft Azure maintains direct connections with those ISPs, but would not ordinarily leverage this partner to reach them. This partner sent a routing table, which resulted in their routes becoming preferred over our direct connections to several ISPs. The resulting Internet connections to the partner, with the misconfiguration, became congested and experienced intermittent packet loss.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case: 1. Microsoft Azure Engineers deployed a software patch immediately following the incident, over the weekend of 24-25 June, to prevent a reoccurrence of the issue (complete). 2. Azure Engineering will provide a final software fix for this issue (in progress).

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

23-6

App Service \ Web Apps - West US 2

Ummary of impact: Between 06:12 and 08:16 UTC on 23 Jun 2017, a subset of customers using App Service \ Web Apps in West US 2 may have experienced difficulties connecting to resources hosted in this region.

Preliminary root cause: Engineers identified a recent deployment task as the potential root cause.

Mitigation: Engineers rolled back the recent deployment task to mitigate the issue.

Next steps: Engineers will review deployment procedures to prevent future occurrences.

17-6

Visual Studio Application Insights - Data Access Issues

Between 22:59 and 23:59 UTC on 16 Jun 2017, customers using Application Insights in East US, North Europe, South Central US and West Europe may have experienced Data Access Issues in the Azure and Application Analytics portal. The following data types were affected: Availability, Customer Event, Dependency, Exception, Metric, Page Load, Page View, Performance Counter, Request, Trace. Please refer to the Application Insights Status Blog for more information: .

Preliminary root cause: Engineers identified a backend configuration error as the potential root cause.

Mitigation: Engineers manually reconfigured the backend service to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

11-6

RCA - Network Infrastructure Impacting Multiple Services - Australia East

Summary of impact: Between 1:48 and 5:15 UTC on June 11, 2017, a subset of customers may have encountered reachability failures for some of their resources in Australia East. Starting at 1:48 UTC, outbound network connectivity in the region was interrupted for a subset of customers, which caused DNS lookups and Storage connections to fail and as a result some customer Virtual Machines shut down and restarted. Engineers investigated and found that a deployment of a new instance of the Azure Load Balancer service was underway in the region. At 3:40 UTC, the new deployment was disabled, after which customer impact for networking services was mitigated. A subset of impacted services experienced a delayed recovery, and engineers confirmed all services were fully mitigated at 5:15 UTC. Deployment activities were paused until a full RCA could be completed, and necessary repairs resolved.

Customer impact: Due to outbound connectivity failure, the impacted services which are dependent on resources internal and/or external to Azure would have seen reachability failures for the duration of this incident. Customer Virtual Machines would have shut down as designed after losing connectivity to the correlating storage services. A subset of customers would have experienced the following impact for affected services:
Redis Cache - unable to connect to Redis caches
Azure Search - search service unavailable
Azure Media Services - media streaming failures
App Services - availability close to 0%
Backup - backup management operation failures
Azure Stream Analytics - cannot create jobs
Azure Site Recovery - protection and failover failures
Visual Studio Team Services - connection failures
Azure Key Vault - connection failures

Root cause and mitigation: A deployment of configuration settings for the Azure Load Balancer service caused an outage when a component of the new deployment began advertising address ranges already advertised by an existing instance of the Azure Load Balancer service. The new deployment was a shadow environment specifically configured to be inactive (and hence, not to advertise any address ranges) in terms of customer traffic. Because the route advertisement from this new service was more specific than the existing route advertisement, traffic was diverted to the new inactive deployment. When this occurred, outbound connectivity was interrupted for a subset of services in the region. The event was triggered by a software bug within the new deployment. The new deployment was passively ingesting state from the existing deployment and validating it. A configuration setting was in place to prevent this ingested state to go beyond validation and result in route advertisement. Due the software bug, this configuration setting failed. The software bug was not detected in deployments in prior regions because it only manifested under specific combinations of the services own address range configuration. Australia East was the first time this combination occurred.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, we will:
1. Repair the software bug to ensure address ranges marked as DO NOT advertise are NOT advertised.
2. Implement additional automatic rollback capabilities of deployments based on health probe for data plane impacting deployments (control plane impacting deployments are already automatically rolled back).
3. Ensure all Azure Load Balancer deployments pass address-range health checks before reaching production.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

10-6

Log Analytics - West Europe

Summary of impact: Between 22:50 UTC on 09 Jun 2017 and 01:20 UTC on 10 Jun 2017, customers using Log Analytics in West Europe may have experienced difficulties when trying to login to the OMS Portal () or when connecting to resources hosted in this region. The OMS Solutions within West Europe workspaces may have failed to properly load and display data. This includes OMS Solutions for Service Map, Insight & Analytics, Security & Compliance, Protection & Recovery, and Automation & Control offerings for OMS.

Preliminary root cause: Engineers determined that some instances of a backend service responsible for processing service management requests had reached an operational threshold, preventing requests from completing.

Mitigation: Engineers scaled out backend instances to increase service bandwidth and mitigate the issue.

7-6

Automation - East US 2

Summary of impact: Between 11:00 and 16:15 UTC on 07 Jun 2017, a limited subset of customers using Automation in East US 2 may have experienced intermittent issues viewing accounts, schedules, assets or starting jobs in the Azure Portal (). Any clients using APIs would have also been impacted, such as PowerShell or SDKs. Jobs which were already submitted would have seen a delayed execution and would have eventually been processed. These start operations should not have been resubmitted. This issue could have impacted up to 3% of customers in the region.

Preliminary root cause: Engineers identified a query that consumed CPU resources on a single backend database, which impacted other queries from completing successfully.

Mitigation: Engineers performed a failover to a secondary database to start with fresh resources and optimized configurations on the offending query.

Next steps: Engineers will continue to optimize query configurations and investigate the underlying cause to prevent future occurrences.

3-6

Root Cause Analysis - ExpressRoute \ ExpressRoute Circuits Amsterdam

Summary of impact: Between 21:03 UTC on 01 Jun 2017 and 22:04 UTC on the 02 Jun 2017, ExpressRoute customers connecting to Azure services through ExpressRoute public peering in Amsterdam may have experienced a loss in network connectivity. Customers with redundant connectivity configurations may have experienced a partial loss in connectivity. This was caused by an ExpressRoute router configuration size which overloaded router resources. Once the issue was detected, configuration was partitioned across multiple devices to mitigate the issue. During this time customers could have failed over to using a 2nd ExpressRoute circuit in the geo or could have failed back to the internet.

Root cause and mitigation: A subset of ExpressRoute edge routers in Amsterdam were overloaded due to the size of configuration. The source NAT on the public peering was overloaded and resulted in loss of connectivity. ExpressRoute routers are redundant in an active/active configuration. A device failover restored connectivity for some time during this period. Upon failover, the second device also got affected by overload and experienced impact.

Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to): 1. [Ongoing] Prevent configuration overload on devices. 2. [Ongoing] Reinforce existing recommend that all customers deploy ExpressRoute with an active-active configuration. 3. [Ongoing] Enhance router resource monitoring to alert before resource limits are reached. 4. [Coming] Support ExpressRoute public peering prefixes on MSFT peering that does not use device NAT. 5. [Coming] Add ability to re-balance load automatically. Direct new connections to new routers in the same location.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

1-6

Issue with accessing Azure services through Mozilla Firefox

Summary of impact: From approximately 20:18 UTC on 28 May 2017 to 03:00 UTC on 1 June 2017, customers leveraging Mozilla Firefox may have been unable to access Azure services, including Azure Management Portal (), Web Apps, Azure Data Lake Analytics, Azure Data Lake Store, Visual Studio Team Services, Azure Service Fabric, Service Bus, and Storage. As a workaround, Firefox users had to use an alternative browser such as Internet Explorer, Safari, Edge, or Chrome.

Root cause and mitigation: This incident was caused by a bug in the service that verifies status of Microsoft services’ certificates via the OCSP protocol. In a rare combination of circumstances this service generated OCSP responses whose validity period exceeded that of the certificate that signed the responses. This issue was detected and fixed. However, the previously generated responses were cached on Azure servers. These cached responses were served to clients via a feature called OCSP stapling that is used commonly to improve user experience. Firefox clients interpreted these OCSP responses as invalid and displayed an authentication failure. Other clients succeeded because they re-tried by directly calling the OCSP service, which had been fixed by then.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 1. Fix the bug in the OCSP service that resulted in OCSP responses with validity longer than the certificate that signs the responses. 2. Add validation to servers that implement OCSP stapling, to ensure that invalid OCSP responses are not passed through to clients.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

kwiecień 2017

11-4

IoT Suite - Failures Provisioning New Solutions - Germany

Summary of impact: Between 07:15 on 08 Apr 2017 and 04:00 UTC on 11 Apr 2017, customers using Azure IoT Suite may have been unable to provision solutions. Engineers recommended deploying from an MSbuild prompt using code at . Existing resources were not impacted.

Preliminary root cause: Engineers identified a recent change to backend systems as the preliminary root cause.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.