Auto-Healing Windows Azure Web Sites

AzureBlog_AutoHealing

How many times have you been woken up in the middle of a night for an issue that was simply resolved by restarting your web site? Wouldn’t it be nice to auto detect certain conditions and automatically recover?

With recent updates to Windows Azure Web Sites (WAWS), we have tried to address these questions. There are some new enhancements to “Always ON” feature and with these enhancements comes the ability to automatically recycling the worker process hosting your web application. We call this the “Auto Healing” feature, and here is how it works:

You simply define the triggers in the root web.config file of your web site and configure the actions to be performed when these triggers are hit. At high level, your configuration section will have following structure,

6064.Apurva Joshi - Auto Heal - image1.png-550x526

NOTE: Just like “Always ON”, this feature is ONLY available with the Standard instances.

Let us break down available options per scenarios.

(Detailed explanation of all supported elements and attributes are at the end of the post.)

Scenario 1 – “Recycling based on Request count”

Consider a scenario where you have a need to recycle your application automatically after it has served X number of requests in Y amount of time. You know that it just doesn’t scale well after huge influx of requests in short amount of time. You want to detect this condition and recycle worker process automatically and log an event.

You simply edit the root web.config file for your application with following sample configuration. (If you have an existing web.config file then please copy <monitoring> section under an existing <system.webServer> section)

7848.Apurva Joshi - Auto Heal - image2.png-550x526

Above configuration will recycle the worker process that has served 1000 requests in 10 minutes. It will also log an event in eventlog.xml file (found in Logfiles folder of your web root directory). Having an event logged helps you track down the occurrence of an auto healed web site and provide important forensic for troubleshooting or root cause analysis. When the first request comes in, we start the timeInterval clock.  We then start counting occurrences.  If the count exceeds the maximum before the timeInterval expires, we take an
action. If the time interval expires, we reset both the timer and the count. The effect of this is that, given above configuration, something like this could happen:

00:00:00 – First request arrives

00:09:59 – 998 requests are served

00:10:00 – Timer expires and is reset to 0

00:10:01 – 999 requests are served

In this scenario, we did not have 1000 requests occur in either the first or second timeInterval window, so no action is taken.

NOTE: If you have multiple instances of your web site, it will only restart the worker process for the instance that has hit this trigger and not all instances.

Example of an event logged in eventlog.xml file.

3527.Apurva Joshi - Auto Heal - image3.png-550x526

 

Scenario 2 – “Recycling based on slow requests”

Consider a scenario where the performance of your application starts degrading and several pages start taking longer time to render. You would like to detect this situation and recycle worker process automatically.

You simply edit the root web.config file for your application with following sample configuration. (If you have an existing web.config file then please copy <monitoring> section under an existing <system.webServer> section)

7824.Apurva Joshi - Auto Heal - image4.png-550x526

Above configuration will recycle the worker process when it detects that 20 requests have taken more than 45 seconds to execute in last 2 minutes. It is important to note that trigger for slowRequests is evaluated at the end of each request execution, which makes it equally important to set timeInterval higher value to timeTaken value.

NOTE: If you have multiple instances of your web site, it will only restart the worker process for the instance that has hit this trigger and not all instances.

Example of an event logged in eventlog.xml file.

3438.Apurva Joshi - Auto Heal - image5.png-550x526

Scenario 3 – “Logging an event (or recycling) based on HTTP status code(s)”

Consider a scenario where you would like to get notified of a situation when your web site starts throwing specific HTTP status codes, sub-status code or win32 status codes. You could choose to recycle or simply log an event in eventlog.xml file (found inside Logfiles folder of your web sites content root)

You simply edit the root web.config file for your application with following sample configuration,

8741.Apurva Joshi - Auto Heal - image6.png-550x526

Above configuration will log an event in eventlog.xml file when it detects that 10 requests resulted in HTTP status code of 500 with sub status code of 100 last 30 seconds.

NOTE: If you have multiple instances of your web site, it will only log an event for the instance that has hit this trigger and not all instances. Optionally, you can choose to recycle instead of just logging an event. Recycling logs an event by default.

Example of an event logged in eventlog.xml file.

4034.Apurva Joshi - Auto Heal - image7.png-550x526

Scenario 4 – “Taking custom actions (or recycling/logging) based on memory
limit”

Consider a scenario where you are troubleshooting a memory leak in your web site and would like to perform a custom actions like generating memory dumps, or sending an email notification or generate memory dumps and recycle the process etc.

You simply edit the root web.config file for your application with following sample configuration,
4834.Apurva Joshi - Auto Heal - image8.png-550x526

Above configuration will execute a custom action to run procdump.exe and generate mini memory dumps when it detects that worker process has reached 800MB of
private bytes. 
Auto healing will not trigger on certain HTTP error codes that are coming from http.sys (kernel driver), where request is not made it into the worker process pipeline. Some examples of such status codes are: 304, 302, 400 (many 400s but not all), 503 etc.

NOTE: If you have multiple instances of your web site, it will only generate memory dumps for the instance that has hit this trigger and not all instances. Optionally, you can
choose to run custom action that will send an email etc. Also note that, procdump.exe is not available by default in root of your web site (d:\home) – it is something you will have xcopy deploy with your web site.

Example of an event logged in eventlog.xml file for action type of recycle.

7026.Apurva Joshi - Auto Heal - image9.png-550x526

Finally, if you would like to configure trigger on specific page/URL then you can use our FREB module and configure steps that are outlined in following blog

http://thenextdoorgeek.com/post/Windows-Azure-Web-Sites-(WAWS)-Collecting-dumps-of-the-worker-process-(w3wpexe)-automatically-whenever-a-request-takes-a-long-time

This approach will have 5-10% performance hit and will require you to enable FREB.

NOTE: Above approach is also effective under Standard mode as well, since we will automatically disable FREB after 1 hour on shared and FREE modes.

 

Following is the list of supported configurations and their meaning.

6471.forgot 1.png-550x526 5040.forgot 2.png-550x526 6675.forgot 3.png-550x526 7380.forgot 4.png-550x526