Today a number of applications running in Windows Azure were unreachable for a period of time. The earliest occurrence took place at 1:15pm PDT today (April 2nd), our team restored affected applications throughout the day, and all applications were accessible by 9:00pm PDT. The storage service was unaffected. We communicated throughout the afternoon on this thread on the Windows Azure forum.
One of the responsibilities of the Fabric Controller is to monitor the health and status of servers and applications in the cloud and automatically restore functionality when applications are unhealthy. To do this, it regularly communicates with servers to request their status. This afternoon, a number of servers stopped responding to the Fabric Controller. The Fabric Controller immediately marked those server as malfunctioning, stopped routing traffic to them, and started migrating affected application instances to new servers. A bug in the Fabric Controller prevented it from moving those instances as quickly as it was designed to do. An application with all its instances on affected servers became unreachable during this period.
What are We Doing in the Future?
In addition to fixing the particular bug mentioned above, we’re continuing our ongoing work to improve our detection and response algorithms in the Fabric Controller, including giving the Fabric Controller a more granular view of the health of applications.
For those who are wondering, though the underlying cause of the March 13th malfunction was very different, the same improvements we’re making in detection and response in the Fabric Controller apply.