Visar mobilvy Skrivbordsvy
|Skalningsuppsättningar för virtuell dator|
|Webb och mobilt|
|Data och lagring|
|Tjänsten Managed Cache|
|SQL Data Warehouse|
|SQL Server Stretch Database|
|API för automatiska förslag för Bing|
|API för ansiktsigenkänning|
|Intelligent tjänst för språkförståelse|
|API för rekommendationer|
|API:er för Bing Search|
|Tal-API för Bing|
|Stavningskontrolls-API i Bing|
|API för textanalys|
|API för webbspråksmodell|
|Analyser med Datasjö|
|Power BI Embedded|
|Internet of Things|
|IoT-hubb i Azure|
|Media och CDN|
|Direktsänd strömning och strömning på begäran|
|Identitets- och åtkomsthantering|
|Azure Active Directory|
|Enterprise State Roaming|
|Azure Active Directory Domain Services|
|Azure Active Directory B2C|
|Access Control Service|
|Visual Studio Team Services|
|Build and Deployment/Build (XAML)|
|Visual Studio Application Insights|
|Labb för utveckling och testning i Azure|
|Klassisk Microsoft Azure-portal|
|Azure Resource Manager|
|Microsoft Azure Portal|
Regionerna i Australien är endast tillgängliga för kunder med företag i Australien eller Nya Zeeland.
Regionerna i Indien är endast tillgängliga för volymlicensieringskunder och partner med en lokal registrering i Indien. Regionerna i Indien öppnar för direkta onlineprenumerationer på Azure 2016.
Summary of impact: Between as early as 22:12 UTC on 21 Jul 2016 and 00:25 UTC on 29 Jul 2016, a subset of customers using Visual Studio Team Services \ Build & Deployment/Build (XAML) in multiple regions may have experienced Nuget Publisher and Nuget Installer build task failures. A subset of customers using a particular configuration option in a build would have succeeded with downloading packages, however, the packages may not have been delivered in the expected location. In addition, a very limited subset of customers may not have been able to download a particular Nuget package due to an error reading "Package contains multiple nuspec files."
Preliminary root cause: Engineers identified a software error in a recent deployment as a preliminary root cause.
Mitigation: Engineers rolled back the deployment in order to mitigate the issue.
Next steps: We will further investigate the preliminary root cause for this issue to ensure that there are no further reoccurrences.
Summary of impact: Between 20:14 UTC on 24 Jul 2016 and 03:49 UTC on 25 Jul 2016, customers using SQL Database and SQL Data Warehouse in Australia Southeast may have experienced control plane operation failures. Create Database, Create Server, Database Restore, and Server Restore operations will have failed for customers during this time. Additionally, updating SLO to a higher edition (such as updating Standard to Premium) would also have resulted in a failure. Resume Data Warehouse was also impacted as a result of this incident. Connections and queries to existing Databases and Data Warehouses were unaffected at this time.
Preliminary root cause: A recent deployment was consuming a larger than expected amount of resources while rolling out to scale units in the region. The amount of resource utilization resulted in the observed control plane failures.
Mitigation: Azure Engineers manually throttled the deployment to roll out incrementally throughout the region to decrease resource utilization and allow control plane operations to complete as expected.
Next steps: Engineers will perform further root cause analysis of this issue and will review the update process that contributed to this incident in order to prevent recurrences of this scenario.
Summary of impact: Between as early as 20:25 UTC on 21 Jul 2016 and 20:42 UTC on 22 Jul 2016, a subset of customers using Visual Studio Team Services \ Build & Deployment/Build (XAML) in South Central US may have experienced issues while trying to trigger a build. Errors may have included “Your Account has no Build or Release minutes. Retry your job later.”
Preliminary root cause: Engineers uncovered a software error with the configuration setting related to Build minutes purchase.
Mitigation: Engineers deployed a hotfix to mitigate the issue and have confirmed that the system is in a healthy state.
Next steps: Continue to investigate the underlying root cause to prevent any reoccurrences. More information may be available at the VSTS blog: http://aka.ms/vstsblog
Summary of impact: Between 17:20 and 17:46 UTC on 21 Jul 2016 a subset of customers using Storage in West Europe may have experienced timeouts while accessing their Storage resources. Additionally, some customers may experience failures or unexpected reboots of their Virtual Machines. While a majority of Virtual Machines customers have experienced full recovery, we have identified a very limited subset of customers who may be experiencing continued impact. These customers will receive further communications through their Management Portal (https://portal.azure.com).
Preliminary root cause: Engineers have identified a recent deployment unexpectedly caused a backend software configuration issue.
Mitigation: Engineers have stopped the deployment which brought the system back to a healthy state.
Next action: Investigate the underlying root cause of this issue and develop a solution to prevent reoccurrences.
Summary of impact: Between 02:29 UTC and 15:15 UTC on 21 Jul 2016, customers using Visual Studio Team Services (VSTS) in multiple regions may have encountered error messages when attempting to access their VSTS account via the Classic Azure Portal (https://manage.windowsazure.com). Affected customers may have been unable to perform Azure Active Directory (AAD) operations.
Preliminary root cause: Engineering have identified a backend configuration issue.
Mitigation: Engineers manually updated a registry entry to allow AAD operations to succeed from inside the management portal.
Next steps: Engineering will investigate the underlying root cause and understand how the misconfiguration occurred.
Engineers have validated recovery for all the downstream impacted Azure services by the SQL incident in East US. From 21:30 UTC on 20 Jul, 2016 and 01:26 UTC on 21 Jul, 2016 a limited subset of customers using Service Bus, Media Services, App Services \ Web Apps, Visual Studio Team Services, Dev Test Labs and Mobile Services may have experienced issue connecting to their services. For more details on the SQL incident, please refer to the SQL incident on the Azure Status History Page.
Summary of impact: Starting at 21:30 UTC on 20 Jul, 2016, 3% of Azure SQL DB databases in the East US region became unavailable. Existing connections were terminated and new connections were refused to these databases. This persisted until 01:25 UTC on 2016-07-21. Background: Azure SQL DB capacity in a region is divided into units called rings. Each ring consists of about 200 VMs belonging to a single Azure Hosted Service and part of one Azure Service Fabric cluster. Service Fabric provides leader election and a directory of the location of primary replica for each database represented by a Service Fabric application. These capabilities require all VMs in the cluster to be communicating with each other. Clusters are as resilient as possible to network partitions, brownouts and other communication failure scenarios. When sufficiently severe communication issues occur, the cluster will "fail", meaning it can no longer provide leader election and location services, making all databases unavailable. Service Fabric clusters are designed to automatically re-form, a process that usually takes about 45 minutes. Alerts regarding login failures were received starting at 21:36 UTC. By ~22:00 UTC engineers determined that one ring of capacity was completely unavailable due to the Service Fabric cluster being down. By ~23:00 UTC engineers determined that the cluster was not re-forming automatically as expected. At this point an extended outage was declared. Engineers divided work into 3 streams: 1. Communication to customers on intermediate recovery options, assist any customers using geo-restore or geo-replication to failover. 2. Restore of all databases from the last backup (occurs every 5 minutes) to another ring in that same region. 3. Reseed the ring and bring all databases back to availability. Workstream #1 issued extended data plane outage communication template at 23:15, stating that 1) availability would be restored and 2) that customers could geo-restore or geo-failover (if using geo-replication) to provide availability sooner, at the risk of divergence because the restore point or failover point would not include all committed transactions on the original databases. Workstream #2 started at 23:00 and was ready to start restoring databases at 23:40. Because ring reformation had started at that point, the workstream was abandoned. Workstream #3 determined that the VMs in the ring were able to communicate with each other, but an error in Azure Service Fabric ring formation was preventing the ring from re-forming. An inflight OS update to VMs in that ring was paused. The ring was manually re-formed by temporarily shutting down a number of VMs in the ring, allowing the ring to re-form and then gradually adding VMs back. This allowed the ring to re-form at 00:30 UTC on 2016-07-20. Ring reformation and startup of all services (databases) took approximately 45 minutes, during this time login success rate gradually climbed to 100%. At 01:25 UTC the incident was determined mitigated.
Customer impact: Customers and dependent services saw connections terminated and new connections refused to the impacted databases
Workaround: Azure SQL DB takes log backups every 5 minutes which are stored in Azure Storage and geo-replicated to secondary region. A customer can restore a database to any point in time in the last 14-30 days (depending on edition) from these backups. The restore takes from 10 minutes to 12 hours depending on the size of the database and performance tier selected. Customers doing a restore could restore read-availability of the database, and could allow writes to that database - although doing this would cause a situation with up to 5 minutes of data divergence after the original database became available again. Azure SQL DB offers additional read-replicas, which can be in the same region or other regions, with <5s lag between primary and the secondary replica. A secondary read-replica can takeover as primary at any point, a process called failover, which takes about 30 seconds. Customers with this could have read availability of the database uninterrupted and could allow writes - with up to 5 seconds of data divergence after the original database became available again. See https://azure.microsoft.com/en-us/documentation/articles/sql-database-geo-replication-overview/ and https://azure.microsoft.com/en-us/documentation/articles/sql-database-recovery-using-backups/. When Azure SQL DB communicates an extended outage notification, it is recommended that customers restore or if using read-replicas, failover; at least to point of having read availability.
Root cause and mitigation: The failure was caused by an error in the communication protocol between VMs belonging to the Service Fabric cluster. This error had been previously identified and a fix exists in the already released Service Fabric 5.0 version. At the time of the incident, Azure SQL DB was in process of validation of that release prior to upgrading to use it. The automatic cluster reformation failure was caused by a race condition in reformation algorithm. The change of encountering this race condition is proportional to the number of VMs in the cluster. By reducing the number of VMs, the cluster was able to re-form. This error had also been previously identified and the fix exists in Service Fabric 5.0 version.
Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to): 1) Deploy fix for error leading to ring failure - in Service Fabric 5.0 and above. This is currently deploying to all Azure SQL DB capacity worldwide. All Azure infrastructure upgrades to Azure SQL DB capacity have been paused until this is complete. 2) Deploy fix to error preventing ring reformation - in Service Fabric 5.0 and above. This is part of the same deployment as #1. 3) Improve diagnosis speed for ring reformation failure. Target is to be able to detect this within 15. 4) Improve guidelines for manual ring reformation so it can occur more rapidly. Target is to be able to perform this within 15. 5) For Azure SQL databases whose durable persistent form is in Azure Storage (Basic and Standard performance tiers currently), implement automated mechanism to create new Service Fabric applications in other rings in the same region. The fully transactionally consistent database files were available in Azure Storage throughout the incident, but no mechanism currently exists to attach those to capacity in other rings in the same region. This would help mitigate these types of incidents, and similar compute and network failures localized to a single Azure Hosted Service could occur in minutes instead of hours.
Summary of impact: Between 15:15 and 20:10 UTC on 20 Jul 2016, a subset of customers in North Central US may have experienced increased latency and potential timeout errors while using their Azure Web Apps services.
Preliminary root cause: Engineers identified an infrastructure issue that was causing slowdowns in HTTP traffic.
Mitigation: Engineers applied mitigation measures for the infrastructure issue to reduce the HTTP latencies to acceptable levels.
Next steps: Continue to investigate the underlying root cause of this issue and develop a solution to prevent recurrences.
Summary of impact: Beginning at or around 14:29 UTC on 18 Jul, 2016, customers hosted in the Azure West US region may have seen intermittent networking failures and latency connecting to resources hosted in that region. Upon detection, mitigating steps were taken to isolate the impacted networking infrastructure, which was completed by 15:20 UTC.
Customer impact: Intermittent networking failures and latency connecting to the resources hosted in West US. Network traffic internal to the Azure region should not have been impacted, but connectivity in and out of the region may have been intermittent.
Root cause and mitigation: A Microsoft network backbone router experienced a failure of one of its route processor cards. There are two redundant cards in the device, but during the recovery from the failed card, which should have caused only a brief period of reconvergence, the router experienced a software issue with communications between the redundant route processor and its internal switching fabric. The error condition on the router resulted in some dropped traffic in the West US region. Microsoft Engineers removed the router from the network to mitigate the issue, replaced the failed hardware on the device to fully mitigate the issue, and placed the router back into rotation.
Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to): 1. Repair the failed hardware on the device, and return device to production service 2. Completed. 3. Work with the equipment manufacturer to create a software fix to prevent this condition from reoccurring. 3. Release remediation fix to all devices worldwide.
Summary of impact: Between 14:30 and 21:33 UTC on 13 Jul, 2016, a subset of customers using Visual Studio Team Services with Microsoft Accounts in Multiple Regions may have experienced issues creating Team Foundation version control projects.
Preliminary root cause: At this stage we do not have a definitive root cause.
Mitigation: Engineers have disabled a feature flag to mitigated this incident.
Next steps: Continue to investigate the underlying root cause and develop a solution to prevent recurrences. More information may be available at the VSTS blog: http://aka.ms/vstsblog