Skip Navigation

Azure status history

Product:

Region:

Date:

December 2018

12/12

Azure Analysis Services - West Central US - Mitigated

Summary of impact: Between 21:55 UTC on 12 Dec 2018 and 09:55 UTC on 13 Dec 2018, a subset of customers using Azure Analysis Services in West Central US may have experienced issues accessing existing servers, provisioning new servers, resuming new servers, or performing SKU changes for active servers. 

Preliminary root cause: Engineers determined that a recent update deployment task impacted a back-end Service Fabric instance which became unhealthy. This prevented requests to Azure Analysis servers from completing. 

Mitigation: Engineers rolled back the recent deployment task to mitigate the issue. 

Next steps: Engineers will review deployment procedures to prevent future occurrences. To stay informed on any issues, maintenance events, or advisories, create service health alerts () and you will be notified via your preferred communication channel(s): email, SMS, webhook, etc.

12/12

Log Analytics - East US

Summary of impact: Between 08:00 and 15:50 UTC on 12 Dec 2018, customers using Log Analytics in East US may have experienced delays in metrics data ingestion.

Preliminary root cause: Engineers determined that several service dependent web roles responsible for processing data became unhealthy, causing a data ingestion backlog.

Mitigation: Engineers manually rerouted data traffic to backup roles to mitigate the issue. 

Next steps: Engineers will continue to investigate to establish the full root cause for why the service instances became unhealthy. To stay informed on any issues, maintenance events, or advisories, create service health alerts () and you will be notified via your preferred communication channel(s): email, SMS, webhook, etc.

12/4

RCA - Networking - South Central US

Summary of impact: Between 02:48 UTC and 04:52 UTC on 04 Dec 2018, a subset of customers in South Central US may have experienced degraded performance, network drops, or timeouts when accessing Azure resources hosted in this region. Applications and resources that retried connections or requests may have succeeded due to multiple redundant network devices and routes available within Azure datacenters.

Root cause: Two network devices in the South Central US region received an incorrect configuration during an automated process to update their configuration and firmware. As a result of the incorrect configuration, these routers were unable to hold all of the forwarding information they were expected to carry and dropped traffic to some destinations. The Azure network fabric failed to automatically remove these devices from service and continued to drop a small percentage of the network traffic passed through these devices.

For the duration of the incident, approximately 7% of the available network links in and out of the impacted datacenter were partially impacted.

The configuration deployed to the network devices caused a problem as it contained one setting incompatible with the devices in the South Central US Region. The deployment process failed to detect that the setting should not be applied to the devices in that region.

Mitigation: The impacted devices were identified and manually removed from service by Azure engineers, the network automatically recovered utilizing alternate network devices. The automated process for updating configuration and firmware detected that the devices had become unhealthy after the update and ceased updating any additional devices.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

- Additional monitoring to detect repeats of this issue (complete)
- Improvement in the system for validating the gold configuration file generated for each type of network device (in planning)
- Determine whether this class of conditions can be safely mitigated by automated configuration rollback, and if so add rollback to the error handling used by the configuration and firmware update service (in planning)

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

November 2018

11/29

Azure Data Lake Store/Data Lake Analytics - North Europe

Summary of impact: Between 08:00 and 13:39 UTC on 29 Nov 2018, a subset of customers using Azure Data Lake Store and/or Data Lake Analytics may have experienced difficulties accessing resources hosted in this region. Data ingress or egress operations may have also timed-out or failed. Azure Data Lake Analytics customers may have experienced job failures. In addition, customers using other services with dependencies upon Azure Data Lake Store resources in this region - such as Databricks, Data Catalog, HDInsight and Data Factory - may also have experienced downstream impact. 

Preliminary root cause: A scheduled power maintenance in a datacenter in North Europe resulted in a small number of hardware instances becoming unhealthy. As result of this, Data Lake Store resources hosted on the impacted hardware became temporarily unavailable to customers.

Mitigation: The power maintenance task was cancelled and the impacted hardware was restarted. This returned the affected services hosted on this hardware to a healthy state, mitigating the impact for customers.

Next steps: Engineers will continue to investigate to establish the full root cause of why the power maintenance task failed and prevent future occurrences.

11/28

Storage - West US 2

Summary of impact: Between 04:20 and 12:30 UTC on 28 Nov 2018, a subset of Storage customers in West US 2 may have experienced difficulties connecting to resources hosted in this region. Customers using resources dependent on Storage may have also seen impact.

Preliminary root cause: Engineers determined that a recent deployment task introduced an incorrect backend authentication setting. As a result, some resources attempting to connect to storage endpoints in the region experienced failures.

Mitigation: Engineers developed an update designed to refresh the incorrect authentication setting. A staged deployment of this update was then applied to the impacted storage nodes to mitigate the issue. In a small number of cases, where application of the fix was not possible, impacted nodes were removed from active rotation for further examination.

Next steps: Engineers will review deployment procedures to prevent future occurrences and a full root cause analysis will be completed. To stay informed on any issues, maintenance events, or advisories, create service health alerts () and you will be notified via your preferred communication channel(s): email, SMS, webhook, etc.

11/27

RCA - Multi-Factor Authentication

Summary of impact:
At 14:20 UTC on November 27th, the Azure Multi-Factor Authentication (MFA) service experienced an outage that impacted cloud-based MFA customers worldwide. Service was partially restored at 14:40 UTC and fully restored at 17:39 UTC. Login scenarios requiring MFA using SMS, voice, phone app, or OATH tokens-based logins were blocked during that time. Password-based and Windows Hello-based logins were not impacted, nor were valid unexpired MFA sessions.

From 14:20 to 14:40 UTC, any user required to perform MFA using cloud-based Azure MFA was unable to complete the MFA process and so could not sign-in. These users were shown a browser page or client app that contained an error ID.

From 14:40 to 17:39 UTC, the problem was resolved for some users but continued for the majority. This subset of users continued to experience difficulties in authenticating via MFA. When experienced, the user would appear to begin the MFA authentication but never receive a code from the service.

This affected all Azure MFA methods (e.g. SMS, Phone, or Authenticator) and occurred whether MFA was triggered by Conditional Access policy or per-user MFA. Conditional access policies not requiring MFA were not impacted. Windows Hello authentication was not impacted.

Root cause and mitigation:
This outage was caused by a Domain Name System (DNS) failure which made the MFA service temporarily undiscoverable and a subsequent traffic surge resulting from the restoration of DNS service. Microsoft detected the DNS outage when it began at 14:20 UTC when engineers were notified by monitoring alerts. 
We sincerely apologize to our customers whose business depends on Azure Multi-Factor Authentication and were impacted by these two recent MFA incidents. Immediate and medium-term remediation steps are being taken to improve performance and significantly reduce the likelihood of future occurrence to customers across Azure, O365 and Dynamics.

As described above, there were two stages to the outage, related but with separate root causes.

  • The first root cause was an operational error that caused an entry to expire in the DNS system used internally in the MFA service. This expiration occurred at 14:20 UTC, and in turn caused our MFA front-end servers to be unable to communicate with the MFA back-end.
  • Once the DNS outage was resolved at 14:40 UTC, the resultant traffic patterns that were built up from the aforementioned issue caused contention and exhaustion of a resource in the MFA back-end that took an extended time to identify and mitigate. This second root cause was a previously unknown bug in the same component as the MFA incident that occurred on 19 of Nov 2018. This bug would cause the servers to freeze as they were processing the backlogged traffic.
  • To prevent this bug causing servers to freeze while a sustainable mitigation was being applied, engineers recycled servers.
  • Engineering teams continued add capacity to the MFA service to assist in alleviating the backlog.
  • Draining the resultant traffic back to normal levels took until 17:39 UTC at which point the incident was mitigated.


Next steps:
We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Implementing the code fix to the MFA backend service to stop servers from freezing when processing high rate of backlogged requests. [COMPLETE]
  • During the last incident we increased capacity in the Europe region, capacity is now being scaled out to all regions for MFA. [COMPLETE]
  • Deploy improved throttling management system to manage traffic spikes. [COMPLETE]
  • All DNS changes will be moved to an automated system [IN PROGRESS]


Provide feedback:
Please help us improve the Azure customer communications experience by taking our survey -

    11/26

    RCA - Virtual Network - UK South and UK West

    Summary of impact: Between 11:19 and 16:56 UTC on 26 Nov 2018, a subset of customers using Virtual Networks in UK South and UK West may have experienced difficulties connecting to resources hosted in these regions. Some customers using Global VNET peering or replication between these regions may have experienced latency or connectivity issues. This issue may have also impacted connections to other Azure services.

    Root cause and mitigation: A 16 minute hardware failure on a WAN router caused a large amount of traffic to failover from the primary path out of the region to the backup path, causing increased congestion on the backup path. Once the hardware failure was resolved, a firmware issue with our WAN traffic engineering solution prevented traffic from returning to the primary path, causing the congestion issue to persist. Engineers manually rerouted network traffic back to the primary path to reduce the congestion and mitigate the issue. 

    Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

    • Fix the firmware issue that prevented traffic from rerouting back to the primary path (completed)
    • Create alerts to detect similar issues with our traffic engineering solution in the future (completed)
    • Assess possible design changes to the WAN Traffic Engineering solution to maximize performance in extreme congestion scenarios (in progress)
    • Create alerts and monitoring process to detect large traffic flows that can be throttled to reduce congestion in this type of situation (completed)
    • Increase capacity in the region to avoid the potential for congestion (completed)

    Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

    11/19

    RCA - Multi-Factor Authentication

    Summary of impact:
    Between 04:39 UTC and 18:38 UTC on 19 November 2018, Microsoft Azure AD Multi-Factor Authentication (MFA) services experienced an outage. Users of Azure Active Directory authentication services - including users of Office 365, Azure, Dynamics and other services which use Azure Active Directory for authentication – were unable to log in if MFA was required as determined by their organization’s policy.  The event was mitigated on Monday, 19 November 2018 at 18:38 UTC. Furthermore, engineers kept the event open and confirmed through extensive monitoring that the root causes identified were correct, incorporated immediate telemetry and processed changes to close the incident on Wednesday, 21 November 2018 at 03:00 UTC.    

    Root cause:
    There were three independent root causes discovered. In addition, gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time. 

    The first two root causes were identified as issues on the MFA frontend server, both introduced in a roll-out of a code update that began in some datacenters (DCs) on Tuesday, 13 November 2018 and completed in all DCs by Friday, 16 November 2018. The issues were later determined to be activated once a certain traffic threshold was exceeded which occurred for the first time early Monday (UTC) in the Azure West Europe (EU) DCs. Morning peak traffic characteristics in the West EU DCs were the first to cross the threshold that triggered the bug. The third root cause was not introduced in this rollout and was found as part of the investigation into this event.

    • The first root cause manifested as latency issue in the MFA frontend’s communication to its cache services. This issue began under high load once a certain traffic threshold was reached. Once the MFA services experienced this first issue, they became more likely to trigger second root cause.
    • The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause (below) on the MFA backend.
    • The third identified root cause, was previously undetected issue in the backend MFA server that was triggered by the second root cause. This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.

    Mitigation:
    There were three main phases of this event:

    Phase 1: Impact to EMEA and APAC customers - 04:39 UTC to 07:50 UTC on 19 Nov 2018:
    To enhance reliability and performance, caching services are used throughout Azure Active Directory. The MFA team recently deployed a change to more effectively manage connections to the caching services. Unfortunately, this change introduced more latency and a race-condition in the new connection management code, under heavy load. This caused the MFA service to slow down processing of requests, initially impacting the West EU DCs (which services APAC and EMEA traffic). During this time, multiple mitigations were applied - including changes in the traffic patterns in the EU DCs, disablement of auto-mitigation systems to reduce traffic volumes and eventually traffic which was rerouted to East US DC. Our expectation was that a healthy cache service in the East US DC would mitigate the latency issues and allow the engineers to focus on other mitigations in the West EU DCs. However, the additional traffic to the East US DC caused the MFA frontend servers to experience the same issue as West EU, and eventually requests started to timeout. Engineers therefore rerouted traffic back to the West EU DCs and continued with the investigation. 

    Phase 2: Broad customer impact - 07:50 UTC to 18:38 UTC on 19 Nov 2018:
    A previously undetected issue in the Azure MFA backend, triggered by the race condition in the front end, and caused an accumulation of processes. Azure MFA backend resource limits were exhausted, preventing the delivery of MFA messages to customers.  During this time, the West EU DCs were still experiencing timeouts in serving requests and in the absence of signals/telemetry to indicate other issues, the engineering team's continued focus was on mitigating the latency issue in the MFA frontend servers. In order to restore the health of these datacenters, engineers rolled back the recent deployment, added capacity, increased throttling limits, recycled MFA cache servers and frontend servers and applied a hotfix to the frontend servers to bypass the cache. This mitigated the latency issue, but customers (inclusive of US Gov and China) were still reporting issues with MFA, therefore engineers increased their focus in looking for root causes other than the MFA frontend latency issue.
    After investigating and identifying issues in the MFA backend servers, engineers cycled the MFA backend servers to fully restore service health.  The initial diagnosis of these issues was difficult because the various events impacting the service were overlapping and did not manifest as separate issues.  This was made more acute by the gaps in telemetry that would identify the backend server issue.  Once these issues were determined and fully mitigated across all DCs, the team continued to monitor events and customer reported issues for the following 48 hours.

    Phase 3:  Post recovery - RCA, Monitoring and analysis of customer reported issues - 18:38 UTC on 19 Nov 2018 to 03:00 UTC on 21 Nov 2018:
    Engineers kept the incident open for a period of approximately 48 hours to monitor and fully investigate any further customer reported cases and confirm that the issues were fully mitigated. We also wanted to increase our confidence that the root causes identified were, in fact, the source of the failures. On Wednesday, 21 November 2018 at 03:00 UTC, the incident was closed. 

    Next steps:
    We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

    • Review our update deployment procedures to better identify similar issues during our development and testing cycles (completion by Dec 2018)
    • Review the monitoring services to identify ways to reduce detection time and quickly restore service (completion by Dec 2018)
    • Review our containment process to avoid propagating an issue to other datacenters (completion by Jan 2019)
    • Update communications process to the Service Health Dashboard and monitoring tools to detect publishing issues immediately during incidents (completed) 

    We always encourage customers to stay informed on any issues, maintenance events, or advisories. They should visit:  and configure notifications via their preferred communication channel(s): email, SMS, webhook, etc. In this incident communications were not promptly sent to the Service Health blade in the management portal for all impacted customers. This was an error from the Azure team, for which we apologize.

    Provide feedback:
    Please help us improve the Azure customer communications experience by taking our survey -

    11/14

    Azure DevTest Labs - Mitigated

    Summary of impact: Between 20:29 and 22:28 UTC on 14 Nov 2018, a subset of customers using Azure DevTest Labs may have received failure notifications when attempting to access their Labs via the Azure Portal.

    Preliminary root cause: Engineers determined that a recent deployment task contained an update which caused calls to an internal API to fail.

    Mitigation: Engineers rolled back the recent deployment task to mitigate the issue. 

    Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

    11/13

    App Service and Function Apps

    Summary of impact: Between 11:10 and 12:25 UTC on 13 Nov 2018, you were identified as a customer using App Service and/or Function Apps who may have received intermittent HTTP 500-level response codes, have experienced timeouts or high latency when accessing App Service (Web, Mobile and API Apps) deployments hosted in these regions. Impacted customers may have also seen issues with their Azure App Service Scaling settings.

    Preliminary root cause: Engineers determined that unhealthy instances of a backend application caused a subset of servers to become unstable, preventing requests from completing.

    Mitigation: Engineers performed a hotfix to patch these servers, in order to mitigate the issue. 

    Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

    11/11

    Storage - West US

    Summary of impact: Between 07:12 and 17:19 UTC on 11 Nov 2018, you were identified as a customer using Storage in West US who may have experienced difficulties reading from a subset of blob storage accounts hosted in this region.

    Root cause: Engineers received monitoring alerts for degraded storage accessibility. Upon investigation, they determined that a single storage scale unit was unreachable from internet. It started when a route configuration at one of the regional internet service providers caused traffic to get re-routed incorrectly and drop. Traffic inside the Microsoft network for the storage scale unit was not affected by this issue. 

    Mitigation: Engineers worked with the internet service provider to correct the route configuration and removed the incorrect route advertisement. 

    Next steps: Microsoft monitors all its route advertisement on the internet to validate the origins of the route. Since this incident, Microsoft has hardened the check for its routes on the internet. Microsoft is also working with large service providers to not accept Microsoft routes from any other service provider.

    We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. 

    11/9

    App Service and Azure Functions

    Summary of impact: Between 22:48 UTC on 08 Nov 2018 and 13:30 UTC on 09 Nov 2018, a subset of customers using App Services may have experienced errors when accessing the Azure Functions blade, or experienced issues when accessing App settings in the Azure portal ()

    Preliminary root cause: Engineers identified a recent deployment to a specific regional instance of the Functions portal as the root cause. 

    Mitigation: Engineers deployed a platform hotfix to mitigate the issue. 

    Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

    11/2

    Azure Portal Timeouts and Latency

    Summary of impact: Between 23:05 UTC on 02 Nov 2018 and 00:21 UTC on 03 Nov 2018, some customers may have experienced high latency or timeouts when viewing resources or loading blades through the Azure Portal ().

    Preliminary root cause: Engineers determined that a recent deployment task introduced an updated DNS record that caused the backend service hosting portal blades to become unhealthy, preventing requests from completing.

    Mitigation: Engineers performed a configuration change to revert the impacting update.

    Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

    October 2018

    10/27

    Virtual Machines/VM Scale Sets - West US 2

    Summary of impact: Between 02:50 and 10:37 UTC on 27 Oct 2018, a subset of customers using Virtual Machines and/or Virtual Machine Scale Sets in West US 2 may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region. 

    Preliminary root cause: Engineers determined that some instances of a backend service responsible for interservice communication became unhealthy which in turn, caused requests between Storage and dependent resources to fail. 

    Mitigation: Engineers performed a change to the backend service to achieve mitigation. Platform telemetry was then observed and service team engineers from affected services confirmed all requests completed successfully. 

    Next steps: Engineers will continue to investigate to establish the full root cause to prevent future occurrences.

    10/24

    RCA - Networking in West US

    Summary of impact: Between 22:40 UTC on 24 Oct 2018 and 00:03 UTC on 25 Oct 2018, a subset of customers may have experienced degraded network performance and/or difficulties connecting to resources in the West US region. 

    Root cause: A network device connecting a datacenter in the West US region experienced a fault during routine fiber maintenance. Azure Networking lost a subset of capacity between the affected data center and other facilities in the West US region.  The failed network device also began silently dropping a portion of the flows that traversed it. 

    Mitigation: The incident was mitigated by rebalancing traffic across the remaining links.  The incident was resolved via restoration of the fiber and the optical system.

    Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. Steps specific to this incident include:

    Evaluate faster link level bidirectional failure detection [in progress]
    Evaluate escalated timeline for higher capacity links for this data center [in progress]
    Expand existing black hole detection scenarios [in progress]
    Review process and validations after fiber plant maintenance [in progress]

    Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

    10/17

    Latency between North Europe and North America

    Summary of impact: Between 14:30 and 16:35 UTC on 17 Oct 2018, a subset of customers with traffic going between the North Europe datacenter and North America regions may have intermittently experienced degraded performance, and in some rare instances network drops or timeouts, when accessing Azure resources.

    Preliminary root cause and mitigation: A fiber cut was determined as the underlying root cause, and engineers balanced traffic to minimize any customer impact. The affected fiber was primarily responsible for replication traffic, so customer impact would have been minimal.

    Next steps: Engineers are engaged as the fiber repair is underway. Once the site is stable and thorough testing has been completed, engineers will route traffic back through the existing paths.

    10/16

    RCA - Azure Service Availability - France Central

    Summary of impact: Between 13:57 and 16:45 UTC on 16 Oct 2018, a subset of customers may have experienced difficulties connecting to resources in the France Central region. This impact was related to a number of Storage and Compute resources that experienced an unexpected power-down during this time.

    Root cause and mitigation: A corrective maintenance activity related to the fire alarm system was taking place in one of the isolated zones (Colos) in the datacenter at the start of this event. While working on the system, the maintenance engineer inadvertently triggered the fire alarm for this Colo, which in normal circumstances would have alerted staff, but would not have caused any impact to running services in the Colo. However in this instance a separate legacy fire suppression system, with an automated power shutdown process, was also triggered by the fire alarm. The activation of this system triggered a general power shutdown process, and a number of storage and compute resources in this single Colo became unavailable as a result.

    This legacy system was configured, pursuant to previous French insurance requirements, to automatically shut down power in the event of a fire suppression activation being triggered, and also to prevent backup power supplies from kicking in. This system was not supposed to be active, or able to trigger the shutdown of the power systems in this Colo.

    The Engineers onsite performed a manual restoration of power in the Colo, and then brought the impacted Storage and Compute resources back online in a structured manner to ensure full data resilience, and complete service restoration. Engineers actively monitored the restoration process, and full service restoration was confirmed at 16:45 UTC, although most services would have recovered before this time.

    Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. Steps specific related to this incident include:

    1. A complete review of the maintenance processes for fire systems in this region to eliminate false alarm triggers during these processes.
    2. A full audit of all datacenters in this region. Looking for legacy fire suppression systems which may still be active or in existence.
    3. The removal of the links to the power supply control systems from fire suppression systems removing the bridge that will allow an alert on one system impacting another.

    Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

    10/13

    Storage - East US

    Summary of impact: Between 01:00 and 06:50 UTC on 13th Oct 2018, a subset of customers in the East US region may have experienced difficulties connecting to resources hosted in this region. This would include Storage accounts and Virtual Machines. Some customers would have seen improvement starting at around 03:30 UTC, with the vast majority of impact mitigated by 06:00 UTC. Full Storage recovery occurred by 06:50 UTC. Limited impact may have been observed for a small number of customers using Azure Log Analytics, App Services, Logic Apps, HDInsight, Azure Site Recovery, Application Insights, Azure Data Factory, Azure Automation, PostgreSQL, MySQL and Azure Backup in East US.

    Root cause and mitigation: At 00:10 UTC, a single busy Storage cluster in the US East region experienced a series of storage software role failures due to abnormally high resource utilization. Azure Storage has redundancy designed in, and normally roles will recover without customer impact. Unexpectedly, a small number of the failed roles did not recover automatically. This increased the load on the remaining servers in the cluster, causing further increased resource use and more failures. Customer impact began at around 01:00 UTC. Eventually the number of failed roles reached a critical point where the storage cluster was no longer able to sustain most customer traffic. Engineers made software configuration changes on the affected cluster to reduce resource utilization, and took recovery actions on Storage role instances that were failing to recover automatically. 

    Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform to help ensure such incidents do not occur in the future. In this case it includes, but is not limited to: 

    1. A software fix has been made to better limit resource use in scenarios where several storage roles fail.  This will be rolled out following our normal safe deployment practices.

    2. Configuration changes have been made to other similar storage clusters in the fleet, prior to the code fix above being deployed.

    3. Investigation into the failed automatic storage role recoveries is ongoing.

    4. Additional alerting will be added to alert the engineering team to situations where roles are not recovering as expected.

    Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

    10/11

    Azure Portal and DevOps Accessibility

    Summary of impact: Between 15:59 UTC and 18:35 UTC on 11 Oct 2018, A subset of customers may have experienced intermittent issues when attempting to sign-in to the Azure Portal and DevOps services.

    Preliminary root cause: Engineers determine that a CDN endpoint that is responsible for processing authentication requests became unreachable.

    Mitigation: Engineers responded to alerts and were in the process of deploying new CDN endpoints for authentication requests. This had been deployed in staging environments to validate mitigation before this was deployed to production. During the validation, engineers became aware that some customers may not have had these new endpoints whitelisted and so engineers began to troubleshoot this issue before any deployment into production. During the investigation, it was validated that the existing endpoints had become accessible and stable. Once engineers validated the stability and tested the existing CDN endpoints, engineers routed traffic back to the original CDN endpoints.

    Next steps: Engineering teams are closely monitoring the existing CDN endpoints for any unusual activity. They will work to understand why the CDN endpoint became unreachable and why redundant mechanisms were not invoked automatically.

    10/11

    App Service - East Asia

    Summary of impact: Between 09:32 and 09:58 UTC on 11 Oct 2018, a subset of customers using App Service in East Asia may have received HTTP 500-level response codes and experienced timeouts or high latency when accessing App Service (Web, Mobile and API Apps) deployments hosted in this region. 

    Root cause: The Microsoft Azure team identified the issue was due to a dependency that the Web App service takes on Azure Storage. The application site content is hosted within Web App platform are backed by Azure Storage in a durable manner.  The application accesses the site contents as file shares (this model maximizes application compatibility). Additionally, the Web App platform employs a feature to improve the impact to customers when the platform detects latency issues within the hosted primary storage volume. This is accomplished by making use of read-only (R/O) replicas (standby volumes) hosted on Azure storage. Web App Engineers collaborated with the Storage team and discovered a previously unknown issue due to regional load balancing of the storage cluster the Web App primary and secondary R/O replicas that application was hosted on. 

    Mitigation: This issue was detected and automatically resolved by the Azure platform's self-healing process. 

    Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform to help ensure such incidents do not occur in the future. In this case it includes, but is not limited to:

    • Fine tuning the self-heal mechanism to pro-actively identify changes due to load balancing and heal faster. [in progress]

    10/8

    Availability Issues with Application Insights Portal

    Summary of impact: Between 23:20 on 08 Oct 2018 and 00:20 UTC on 09 Oct 2018, a subset of customers using Application Insights may have experienced issues connecting to their Application Insights resources. Application Insight portal blades may have not loaded for some customers.

    Preliminary root cause: Engineers identified a configuration change that caused a backend service to become unhealthy, impacting the availability of the Application Insights portal for some customers.

    Mitigation: Engineers failed over to a healthy backend service to mitigate impact for customers.

    Next steps: Engineers will continue to investigate the full root cause to prevent future occurrences.

    10/8

    RCA - Service Bus - East US and West US

    Summary of impact: Between 21:00 UTC on 08 Oct 2018 and 01:00 UTC on 10 Oct 2018, a subset of customers using Service Bus in East US 2 and West US may have experienced intermittent timeouts on runtime and management operations. Retries may have been successful for some customers. 

    Root cause and mitigation: Service Bus and Event Hubs recently added support for Service Endpoints to support VNet scenarios. For this support to work, a VM Extension which provides service tunneling for Azure Services was added to the Service Bus and Event Hubs gateway VMs. The Service Bus and Event Hub gateway VMs also are configured with Direct Server Return (DSR) endpoints.

    Recently, the service tunneling extension was upgraded to v4.1. Under certain conditions, the way the service which configures DSR and the Service Tunneling extension interact can cause the Service Bus and Event Hubs gateway to lose the ability to receive load balanced traffic. On this incident, that condition was triggered when the Service Bus VMs received the monthly OS Patch. Once a VM was impacted, the load balance traffic was not recovered until the VMs were manually patched to fix the misconfiguration.

    Next steps: We sincerely apologize for the impact to our affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

    1. Fix - Modify Azure Guest Agent to pick a predictable loopback adapter to configure direct server return
    2. Detect - Increase priority of alerting from outside monitoring

    10/8

    Azure DevOps

    Summary of impact: Between 11:40 and 13:30 UTC on 08 Oct 2018, a subset of Azure DevOps customers may have experienced error messages, latency, and/or authentication issues. For more information regarding this issue, please refer to:

    In addition, a small subset of Azure Portal customers may have experienced issues logging into the Azure portal.

    Preliminary root cause: Engineers determined that a network device experienced a hardware fault and that network traffic was not automatically rerouted.

    Mitigation: Engineers took the faulty network device out of rotation and rerouted network traffic to mitigate the issue.

    10/4

    Azure DevOps

    Summary of impact: Between 17:47 and 18:40 UTC on 04 Oct 2018, users with Azure DevOps organizations in South Central US may have experienced errors while using the service.

    Preliminary root cause: Engineers determined that this was caused by a networking event that impacted communication between one Azure DevOps scale unit in South Central US from the rest of the world.

    Mitigation: After the network event was self-recovered, Azure DevOps performed a manual action to reset our web servers which expedited the recovery from the network incident.

    Next steps: Azure DevOps and Networking teams will continue to investigate to establish the full root cause and prevent future occurrences.

    10/3

    Azure DevOps

    Summary of impact: Between 16:45 and 18:15 UTC on 03 Oct 2018, a subset of customers using Azure DevOps may have been unable to access DevOps resources. Additional information can be found at

    Preliminary root cause: Engineers determined that a network device experienced a hardware fault, causing a DevOps backend service to become unavailable.

    Mitigation: Engineers took the faulty network device out of rotation and rerouted network traffic to mitigate the issue.

    Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

    10/2

    ExpressRoute Circuit - Australia Southeast

    Summary of impact: Between 01:00 UTC and 04:45 UTC on 02 Oct 2018, a subset of customers began facing intermittent packet loss in the Melbourne ExpressRoute location for Australia Southeast region. During the period of degradation, customers may have experienced intermittent packet loss when using an ExpressRoute circuit hosted on the Microsoft Enterprise Edge device (MSEE) experiencing this issue. Access to a Virtual Network and the corresponding resources using a Virtual Network Gateway and services hosted on public and Microsoft peering were also affected when connecting using the ExpressRoute service.

    Root cause and mitigation: ExpressRoute, as with other Azure services, is designed and implemented in a highly available manner and architecture. ExpressRoute has multiple redundancies implemented in both software and hardware to ensure maximum uptime for customers. It was identified that an upstream physical port experienced a partial failure causing the connection between the ExpressRoute routers and the Microsoft global network to intermittently drop packets. Due the partial failure, the physical port was not removed from rotation successfully causing impact to customers consuming services.
    Automatic detection mechanisms are in place to detect failure modes in the platform. The detection mechanisms triggered an internal alert and engineers engaged to begin troubleshooting. An unrelated issue had previously arisen that did not immediately make this issue apparent, causing a slight delay in mitigation. When engineers pinpointed the issue, the port was removed from rotation and services were immediately restored.

    Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case, it includes (but is not limited to):

    • Add additional monitoring parameters in our automated detection platform to better classify and alert on intermittent failures [In Progress].
    • Modify existing processes to localize on faults and mitigate faster. Add automation where applicable [In Progress].

    September 2018

    9/30

    Multiple Services - Korea South

    Summary of impact: Between 13:50 and 19:10 UTC on 30 Sep 2018, a subset of Storage customers in Korea South may have experienced difficulties connecting to resources hosted in this region. Some services with dependencies on Storage in the region were also impacted. 

    Preliminary root cause: Engineers determined that a single storage scale unit experienced impact to a limited subset of its storage resources, and engineers are still investigating the underlying cause. 

    Mitigation: Engineers manually restarted the impacted Storage nodes to mitigate the issue. 

    Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

    9/26

    RCA - Multiple Services - Southeast Asia

    Summary of impact: Between 09:52 and 12:28 UTC on 26 Sep 2018, a subset of customers in Southeast Asia may have experienced latency or difficulties connecting to Virtual Machines and/or Cloud Service resources hosted in this region. Customers using a number of related services including: App Services, Redis Cache, Application Insights, Stream Analytics, and API Management may have also experienced latency or difficulties connecting to these services in the region.

    Workaround: This incident impacted a single scale unit in the Southeast Asia region.  Customers could have reduced the impact of this incident by using VM Scale Sets, or by deploying in more than one region and using Azure Traffic Manager to direct requests to a healthy deployment. 

    Root cause and mitigation: In Azure's network architecture, each rack of servers in a data center row connects to 8 network switches located in the middle of the row. This provides 8-way network redundancy at the row level, and makes it very unlikely that failures of the row-level switches will impact customer traffic.

    In this incident, there was a failure of the RS-232 serial line aggregator that provides connectivity to the console ports of all 8 row-level switches. The flapping behavior of the console ports triggered a bug in the console line driver of the Operating System kernel running on the 8 row-level network switches, and caused a correlated failure of all 8 row-level switches. The failure of all 8 row-level switches led to loss of network connectivity to all servers in the racks located in that row. That, in turn, caused impact to all services hosted on the servers and caused the reboot of most VMs running on the servers.

    Engineers mitigated this incident by restarting the affected row-level switches and recovering impacted downstream services. 

    Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. We are taking the following steps to prevent a recurrence of this type of issue:

    1. Repair the tty communications subsystem in the operating system running on the switch to be resilient to console line state changes [In Progress]
    2. Add code to the switch to recover itself even if the tty communications subsystem fails [In Design]
    3. Add alerting to detect switches failing in this manner [Completed]

    Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:


    9/19

    Log Analytics - Latency, Timeouts and Service Management Failures

    Summary of impact:  Between 10:54 and 18:45 UTC on 19 Sep 2018, a subset of customers using Log Analytics and/or other downstream services may have experienced latency, timeouts or service management failures.

    Impacted services and experience: 

    Log Analytics - Difficulties accessing data, high latency and timeouts when getting workspace information, running Log Analytics queries, and other operations related to Log Analytics workspaces.
    Service Map - Ingestion delays and latency.
    Automation-  Difficulties accessing the Update management, Change tracking, Inventory, and Linked workspace blades.
    Network Performance Monitor - Incomplete data in tests configured.

    Preliminary root cause: A backend Log Analytics service experienced a large number of requests which caused failures in several other dependent services.

    Mitigation: Engineers applied a backend scaling adjustment to mitigate the issue.

    Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

    9/15

    Virtual Machines - Metrics Unavailable

    Summary of impact: Between 06:30 UTC on 15 Sep 2018 and 22:35 UTC on 17 Sep 2018, a subset of customers may have experienced difficulties viewing Virtual Machines and/or other compute resource metrics. Metric graphs including CPU (average), Network (total), Disk Bytes (total) and Disk operations/sec (average) may have appeared empty within the Management Portal. Auto-scaling operations were potentially impacted as well. 

    Root cause: Scheduled maintenance was being completed on a gateway component within the metric ingestion pipeline.  During this time one instance was taken down, and leaving up another instance to service traffic.  When this transition occurred it was not properly handled by the publishing agent, as the result it failed to continue publication for a subset of customers Virtual Machines metrics.  In parallel, the agent uses a VIP to communicate with the ingestion gateway at the frontend.  When the VIP was failed over, the agent did not properly detect and re-establish connectivity to the functioning instance during the maintenance period.  The gateway did not use reserved IPs, and received a new endpoint upon maintenance completion.  These two triggers caused the agent to reach an invalid state it was not able to recover without manual mitigation.

    Mitigation: A change was made to the agent which forced it to recycle and re-establish a proper connection to the gateway component.  Once this occurred, metric publication began flowing as expected.

    Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future.  Steps specific to this incident include:

    1. Further localization of the publication pipeline to minimize the potential of broad impact, when future maintenance is performed [In progress].
    2. Key monitoring gaps in the pipeline have been identified to more rapidly allow the issues of this nature to be detected and resolved without impact in the future. [In progress]
    3. Failure recovery has been improved in the client to avoid long periods of publication to a non-healthy gateway. [In progress]

     
    9/15

    Metrics Missing - Germany

    Summary of impact: Between 06:30 UTC on 15 Sep 2018 and 22:35 UTC on 17 Sep 2018, a subset of customers may have experienced difficulties viewing Virtual Machines and/or other compute resource metrics. Metric graphs including CPU (average), Network (total), Disk Bytes (total) and Disk operations/sec (average) may have appeared empty within the Management Portal. Auto-scaling operations were potentially impacted as well. 

    Root cause: Scheduled maintenance was being completed on a gateway component within the metric ingestion pipeline. During this time one instance was taken down, leaving up another instance to service traffic. When this transition occurred, it was not properly handled by the publishing agent, as the result it failed to continue publication for a subset of customers Virtual Machines metrics. In parallel, the agent uses a VIP to communicate with the ingestion gateway at the frontend. When the VIP was failed over, the agent did not properly detect and re-establish connectivity to the functioning instance during the maintenance period. The gateway did not use reserved IPs and received a new endpoint upon maintenance completion. These two triggers caused the agent to reach an invalid state it was not able to recover without manual mitigation. 

    Mitigation: A change was made to the agent which forced it to recycle and re-establish a proper connection to the gateway component. Once this occurred, metric publication began flowing as expected. 

    Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future.  Steps specific to this incident include:

    1. Further localization of the publication pipeline to minimize the potential of broad impact, when future maintenance is performed [In progress].
    2. Key monitoring gaps in the pipeline have been identified to more rapidly allow the issues of this nature to be detected and resolved without impact in the future. [In progress]
    3. Failure recovery has been improved in the client to avoid long periods of publication to a non-healthy gateway