Azure durum geçmişi

Ürün:

Bölge:

Tarih:

Mart 2017

25.3

Intermittent Authentication Failures due to Underlying Azure Active Directory Issue

Summary of ımpact: Between 21:23 on 24 Mar 2017 to 00:35 UTC on 25 Mar 2017, a subset of customers using Azure Active Directory to authenticate to their Azure resource might have experienced failures. This included authentications using the Management portal, PowerShell, command line interfaces and other authentication providers. Engineers confirmed impacted services included Power BI Embedded, Visual Studio Team Services, Log Analytics, Azure Data Lake Analytics, Azure Data Lake Store, Azure Data Catalog, Application Insights, Stream Analytics, Key Vault, and Azure Automation.

Prelımınary root cause: At this stage engineers have no definite root cause and are continuing to investigate the cause of the issue.

Mıtıgatıon: Engineers redistributed authentication traffic to mitigate the issue.

Next steps: Engineers are continuing to monitor and will conduct a full Root Cause Analysis.

23.3

RCA – Data Lake Analytics and Data Lake Store – East US 2

Summary of ımpact:

Between approximately March 22 15:02 UTC and March 22 17:25 UTC, a subset of customers experienced a number of intermittent failures when accessing Azure Data Lake Store (ADLS) in East US 2. This may have caused Azure Data Lake Analytics job and/or ADLS request failures. This incident was because of a misconfiguration which resulted in one of the ADLS microservices not serving the requests for the customers.

Azure Monitoring detected the event and an alert was triggered. Azure engineers engaged immediately and mitigated the issue by reverting the configuration that triggered the incident.

Customer ımpact: A subset of customers in East US 2 had intermittent failures when accessing their Azure Data Lake Service during the timeframe mentioned above.

Workaround:

Retry failed operation. If a customer uses Azure Data Lake Store Java SDK 2.1.4 to access the service, the SDK would have automatically retried and some of the requests may have succeeded with higher latencies.

Root cause and mıtıgatıon:

A misconfiguration in one of the microservices that ADLS depends on caused the ADLS microservices to not serve requests after they, or the instances they resided on restarted (which restart is part of regular maintenance process). As more instances were restarted, there were fewer instances of the affected ADLS microservice remaining to serve the requests, which resulted in high latencies and consequently requests failing. Azure engineers identified and reverted the configuration at fault. As the configuration fix was propagated as an expedited fix, and the services were restarted, the affected service instances started to serve requests again, which mitigated this issue. This resulted in intermittent Azure Data Lake Analytics job and ADLS request failures. The jobs that were submitted during the incident are expected to have been queued and ran after the incident, which would have resulted in delays for the start of the jobs.


Was the issue detected?
The issue was detected by our telemetry. An alert was raised, which prompted Azure engineers to investigate an Azure Data Lake Store issue and mitigate the issue.


To achieve quickest possible notification of Service Events to our customers, the Azure infrastructure has a framework that automates the stream from alert to Service Health Dashboard and/or Azure Portal Notifications. Unfortunately, this class of alert does not contain the needed correlation for automation now.  We did surface this outage via the Resource Health feature with customer’s ADLA and ADLS account(s) in this region.  We will continue to implement notification automations as well as ensuring manual communications protocols are followed quickly as possible. In this incident, the issue was announced on Service Health Dashboard manually.

Next steps:

We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

• Fix the misconfiguration. [Completed]
• Improve testing by adding more validations for this sort of configuration error to prevent future occurrences. [Completed]
• Separate the defined configuration container such that it is limiting the impact of future occurrences to smaller slices of the service. [In progress]
• Improve the format of the configuration file to make the code review more efficient in the future. [In progress]
• Add alerts deeper in the stack such that there is earlier/redundant indications of a similar problem for faster debugging. [In progress]
• Rearchitect the initialization phase of the affected service to reduce the dependency on the service that was originally impacted. [Long term]

Provide feedback:

Please help us improve the Azure customer communications experience by taking our survey

23.3

Azure Active Directory

Summary of ımpact: Between 22:00 UTC on 22 Mar 2017 and 00:30 UTC on 23 Mar 2017, Azure Active Directory encountered an issue that affected service to service authentication that impacted multiple Azure services. Azure Resource Manager, Logic Apps, Azure Monitor, Azure Data Lake Analytics, and Azure Data Lake Store customers may have experienced intermittent service management issues as downstream impact. AAD customers would not have seen impact.

Prelımınary root cause: Engineers identified a recent configuration change as the potential root cause.

Mıtıgatıon: Engineers rolled back the configuration change to mitigate the issue.

22.3

Increased latency accessing Azure resources

Summary of impact: Between 23:23 on 21 Mar 2017 and 01:35 UTC on 22 Mar 2017, a subset of customers may have experienced increased latency or network timeouts while attempting to access Azure resources with traffic passing through the East Coast. 

Preliminary root cause: Engineers have identified the root cause as infrastructure impacted by a facilities issue in a 3rd party regional routing site located in South East US.

Mitigation:  Engineers executed standard procedures to redirect traffic and isolate the impacted facility restoring expected routing availability.
21.3

Microsoft Accounts Experiencing Login Failures

Summary of impact: Between 17:30 and 18:55 UTC on 21 Mar 2017, a subset of Azure customers may have experienced intermittent login failures while authenticating with their Microsoft Accounts. This would have impacted the ability for customers to authenticate to their Azure management portal (), PowerShell, or other workflows requiring Microsoft Account authentication. Customers authenticating with Azure Active Directory or organizational accounts were unaffected. Customers who had an active session already authenticated with a Microsoft Account before the impact started would also have been unaffected.  

Preliminary root cause: Engineers are investigating the full root cause.

Mitigation: Engineers deployed a fix to mitigate the issue. Users may need to exit or refresh their browsers in order to successfully sign in.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.
21.3

Storage and Virtual Machines - West Europe

Summary of ımpact: Between 12:15 and 14:28 UTC on 21 Mar 2017, a subset of customers using Storage and Virtual Machines in West Europe may have experienced latency when accessing their resources in this region. Requests experiencing latency would likely have eventually succeeded.

Prelımınary root cause: Engineers are continuing to investigate the underlying cause at this time.

Mıtıgatıon: The issue was self-healed by the Azure platform.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

18.3

Log Analytics - East US

Summary of ımpact: Between 02:47 and 13:15 UTC on 18 Mar 2017, a subset of customers using Log Analytics in East US may have received intermittent failure notifications - such as internal server errors, or tile or blade not loading - when performing log search operations.

Prelımınary root cause: Engineers determined that a backend service responsible for processing service management requests had reached an operational threshold, preventing requests from completing.

Mıtıgatıon: Engineers cleared out the service management request queues and restarted the backend service to mitigate the issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

16.3

RCA - Storage Availability in East US

Summary of ımpact: Beginning at 22:19 UTC Mar 15 2017, due to a power event, a subset of customers using Storage in the East US region may have experienced errors and timeouts while accessing storage accounts or resources dependent upon the impacted Storage scale unit. As a part of standard monitoring, Azure engineering received alerts for availability drops for a single East US Storage scale unit. Additionally, data center facility teams received power supply failure alerts which were impacting a limited portion of the East US region. Facility teams engaged electrical engineers who were able to isolate the area of the incident and restored power to critical infrastructure and systems. Power was restored using safe power recovery procedures, one rack at time, to maintain data integrity. Infrastructure services started recovery around 0:42 UTC Mar 16 2017. 25% of impacted racks had been recovered at 02:53 UTC Mar 16 2017. Software Load Balancing (SLB) services were able to establish a quorum at 05:03 UTC Mar 16 2017.  At that moment, approximately 90% of impacted racks were powered on successfully and recovered. Storage and all storage dependent services recovered successfully by 08:32 UTC Mar 16 2017. Azure team notified customers who had experienced residual impacts with Virtual Machines after mitigation to assist with recovery.

Customer ımpact: A subset of customers using Storage in the East US region may have experienced errors and timeouts while accessing their storage account in a single Storage scale unit. Virtual Machines with VHDs hosted in this scale unit shutdown as expected during this incident and had to restart at recovery. Customers may have also experienced the following:

- Azure SQL Database: approx. 1.5% customers in East US region may have seen failures while accessing SQL Database.
- Azure Redis Cache: approx. 5% of the caches in this region experienced availability loss.
- Event Hub: approx. 1.1% of customers in East US region have experienced intermittent unavailability.
- Service Bus: this incident affected the Premium SKU of Service Bus messaging service. 0.8% of Service Bus premium messaging resources (queues, topics) in the East US region were intermittently unavailable.
- Azure Search: approx. 9 % of customers in East US region have experienced unavailability. We are working on making Azure Search services to be resilient to help continue serving without interruptions at this sort of incident in future.
- Azure Site Recovery: approx. 1% of customers in East US region have experienced that their Site Recovery jobs were stuck in restarting state and eventually failed. Azure Site Recovery engineering started these jobs manually after the incident mitigation.
- Azure Backup: Backup operation would have failed during the incident, after the mitigation the next cycle of backup for their Virtual Machine(s) will start automatically at the scheduled time.

Workaround: Virtual Machines using Managed Disks in an Availability Set would have maintained availability during this incident. For further information around Managed Disks, please visit the following sites. For Managed Disks Overview, please visit . For information around how to migrate to Managed Disks, please visit: .

- Azure Redis Cache: although caches are region sensitive for latency and throughput, pointing applications to Redis Cache in another region could have provided business continuity.
- Azure SQL database: customers who had SQL Database configured with active geo-replication could have reduced downtime by performing failover to geo-secondary. This would have caused a loss of less than 5 seconds of transactions. Another workaround is to perform a geo-restore, with loss of less than 5 minutes of transactions. Please visit for more information on these capabilities.

Root cause and mıtıgatıon: Initial investigation revealed that one of the redundant upstream remote power panels for this storage scale unit experienced a main breaker trip. This was followed by a cascading power interruption as load transferred to remaining sources resulting in power loss to the scale unit including all server racks and the network rack. Data center electricians restored power to the affected infrastructure. A thorough health check was completed after the power was restored, and any suspect or failed components were replaced and isolated. Suspect and failed components are being sent for analysis.

Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):

- The failed rack power distribution units are being sent off for analysis. Root cause analysis continues with site operations, facility engineers, and equipment manufacturers.
- To further mitigate risk of reoccurrence, site operations teams are evacuating the servers to perform deep Root Cause Analysis to understand the issue
- Review Azure services that were impacted by this incident to help tolerate this sort of incidents to serve services with minimum disruptions by maintaining services resources across multiple scale units or implementing geo-strategy.

Provıde feedback: Please help us improve the Azure customer communications experience by taking our survey

16.3

RCA - Storage provisioning impacting multiple services

Summary of ımpact: Between 21:42 on 15 Mar 2017 UTC and 00:01 on 16 Mar 2017 UTC, customers may have experienced errors or failures in creating, managing, or deleting storage accounts. Alerts were received and Engineering engaged to investigate the failing operations. Investigation pointed to service operations that were dependent on storage account management operations which were intermittently failing. Engineers identified an issue in Storage Resource Provider (SRP) services, and began work to remediate the issue. A mitigation was implemented and full recovery of the SRP services was confirmed as service management operations returned to expected availability levels. Note: Azure tables, blobs, disks, files, and queues in existing storage accounts maintained expected read/write availability SLA during this incident.

- Virtual Machine or Cloud Service customers may have experienced failures when attempting to provision resources.
- Storage customers may have been unable to provision new Storage resources or perform service management operations on existing resources.
- Azure Search customers may have been be unable to create, scale, or delete services.
- Azure Monitor customers may have been unable to turn on diagnostic settings for resources.
- Azure Site Recovery customers may have experienced replication failures.
- API Management 'service activation' operations may have failed.
- Azure Batch customers may have been unable to provision new resources, but all existing Azure Batch pools would have scheduled tasks as normal.
- Event Hub customers may have experienced failures in managing 'Archive' services called 'Archive'.
- Visual Studio Team Services Build customers may have experienced failures and delays in build processing.

Root cause and mıtıgatıon: Storage Resource Provider (SRP) services handle storage account management operations for all regions (create/update/delete/list). These services were affected by a code defect which caused degradation in the master component of the service. The code defect was the result of a race condition. This resulted in intermittent failures for requests to the SRP services. A hotfix was put in place to mitigate the code defect which allowed the SRP services to recover to expected availability levels. There was no impact to Standard or Premium Storage service availability.

Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):

-Implement self-healing mitigations so that the Storage Resource Provider components can automatically recover even in the event of such failures re-occurring
- Storage Resource Provider resiliency improvements in progress including regional resiliency to limit future impact scope.

Provıde feedback: Please help us improve the Azure customer communications experience by taking our survey

15.3

Azure Media Services and Media Services \ Streaming - Multiple Regions

Summary of ımpact: Between 00:30 and 5:27 UTC on 15 Mar 2017, a subset of customers using Media Services \ Streaming and Media Services may have experienced issues authenticating or creating new Media Services accounts and the streaming of encrypted/protected media may have been degraded. Channel operations may also have experienced latency or failures.

Prelımınary root cause: A back end Storage service responsible for processing requests had reached an operational threshold, preventing requests from completing. 

Mıtıgatıon: Engineers decreased the load on the Storage service by reallocating internal resources to other storage accounts. 

Next steps: Engineering will implement a system of automation where this task does not require manual intervention. In addition, engineering teams will examine the back end Storage dependency and implement changes to reduce chances of future customer impact.