Smart failure anomaly detection

Publisert på 11 april, 2016

Program Manager

We recently added an automatic alert that will tell you if there’s a sudden disruption or degradation in your web app’s performance. If there’s an abnormal rise in the rate of failed requests, we’ll let you know within minutes, so you can investigate while most users are still unaware of the problem.

Best of all, you don’t need to do anything to configure it, provided your app is set up with Visual Studio Application Insights and is sending certain minimum of requests telemetry. It works for both .NET and Java web apps, whether hosted in the cloud or on your own servers. It automatically learns normal patterns of failure rate for your app, and raises the alert on an abnormal rise. Learn more.

Diagnostics drill-in

The mail alert doesn’t just warn you of the problem. It carries valuable diagnostic information, highlighting characteristics that are common across the failures, whether it’s the response code, the operation name, the application version, or other properties. It also carries an exception, trace and dependency call when these are relevant to the problem. From the links in the email, you can click straight through to see specific failed requests in the Application Insights portal, from there to dependency failure, exception, call stack or other related telemetry.

NRT Proactive Diagnostics drill-in example

What's the benefit of these alerts?

There are two great advantages of these alerts: automatic adaptation to the behavior of your app and appropriate diagnostic information.

As you probably know, you have always been able to set alerts on a chosen threshold of any metric. But the drawback there is that it can be difficult to determine the appropriate thresholds for each metric. It takes time for you to become familiar with the normal behavior of your system. There is in any case, no single ideal threshold. The failure rate may vary under load; some requests are more failure-prone; and so on. During this period, you learn what abnormal behavior looks like. You gradually find an optimal threshold that enables detection without too many false alarms. By contrast, the new Proactive Diagnostics alert does that learning for you, and raises the alarm when there’s a rise that is unexpected in the light of other factors.

Once a detection is made and you are aware of an issue, you still need to have more information when triaging it. What is the scale of the problem and its urgency? How many users are affected? Some of the information might be available in your dashboards, but often you have to perform some analysis on telemetry to get a sufficient view.

Diagnosis of the problem can be a difficult task, as the problem might be caused by a bug in the code, configuration, storage or other external services (databases, REST services) that the app is using. But in the Proactive Diagnostics alert, we collect information about the anomaly to highlight what is likely to be the cause.

Smart failure anomaly detection alert helps you detect service disruption, or degradation in minutes and provides you with supportive information that simplifies and expedites the diagnosis of the root cause.

Please share your ideas for new or improved features on the Application Insights UserVoice page and submit your questions to the Application Insights Forum.