The Windows Azure Malfunction This Weekend

The Windows Azure Malfunction This Weekend • 1 min read

Posted on March 17, 2009
1 min read

Tag: Announcements

First things first: we’re sorry. As a result of a malfunction in Windows Azure, many participants in our Community Technology Preview (CTP) experienced degraded service or downtime. Windows Azure storage was unaffected.

In the rest of this post, I’d like to explain what went wrong, who was affected, and what corrections we’re making.

What Happened?

During a routine operating system upgrade on Friday (March 13^th), the deployment service within Windows Azure began to slow down due to networking issues. This caused a large number of servers to time out and fail.

Once these servers failed, our monitoring system alerted the team. At the same time, the Fabric Controller automatically initiated steps to recover affected applications by moving them to different servers. The Fabric Controller is designed to be very cautious about taking broad recovery steps, so it began recovery a few applications at a time. Because this serial process was taking much too long, we decided to pursue a parallel update process, which successfully restored all applications.

What Was Affected?

Any application running only a single instance went down when its server went down. Very few applications running multiple instances went down, although some were degraded due to one instance being down.

In addition, the ability to perform management tasks from the web portal appeared unavailable for many applications due to the Fabric Controller being backed up with work during the serialized recovery process.

How Will We Prevent This in the Future?

We have learned a lot from this experience. We are addressing the network issues and we will be refining and tuning our recovery algorithm to ensure that it can handle malfunctions quickly and gracefully.

For continued availability during upgrades, we recommend that application owners deploy their application with multiple instances of each role. We’ll make two the default in our project templates and samples. We will not count the second instance against quota limits, so CTP participants can feel comfortable running two instances of each application role.

The Windows Azure Malfunction This Weekend

What Happened?

What Was Affected?

How Will We Prevent This in the Future?

Explore

Related posts

Enabling Diagnostic Logging in Azure API for FHIR®

Azure におけるインフラから SAP アプリケーションレイヤーまでの IRAP Protected コンプライアンス

MileIQ and Azure Event Hubs: Billions of miles streamed

Azure Stack IaaS – part ten

Join the conversation

おすすめ

AI + machine learning

分析

コンピューティング

コンテナー

データベース

DevOps

開発者ツール

ハイブリッド + マルチクラウド

ID

統合

モノのインターネット (IoT)

管理とガバナンス

メディア

移行

複合現実

モバイル

ネットワーク

セキュリティ

ストレージ

Web

Windows Virtual Desktop

ユース ケース

アプリケーション開発

AI

クラウドの移行とモダン化

データと分析

ハイブリッド クラウドとインフラストラクチャ

モノのインターネット (IoT)

セキュリティとガバナンス

組織の種類

リソース

What Happened?

What Was Affected?

How Will We Prevent This in the Future?

Explore

Related posts

Join the conversation

ユースケース

ハイブリッドクラウドとインフラストラクチャ