Minimize MTTD: mean time to detect, triage, asses issue impact, and finally diagnose the issue using Performance Buckets to help make the detect-triage-diagnose process more efficient.

This blog post describes how you can minimize MTTD: mean time to detect, triage, assess issue impact, and finally diagnose the issue using Application Insights. While going through this scenario, a new feature, Performance Buckets, will be introduced. Performance Buckets help make the detect-triage-diagnose process more efficient.

Applications Insights is a service that allows developers to monitor performance, availability, and usage of their applications. It becomes a vital part of the application lifecycle. Today we will be looking into the Detect-Triage-Diagnose process for scenarios relating to web application performance.

01-pipe1

Detect: Poor server response time

The Service Overview blade in Application Insights is designed to give you an overview of your application status. One of the key charts on this blade is the server’s average response time. You can pin your favorite applications to the start page or create a favorites view for metrics you want to watch every day. This will provide quick access to your application telemetry.

detect2

Running Application Insights service, we meet every morning to review the health of key services. We have key dashboards to review and react on any anomaly we notice. Response time is one of the metrics we track. Your team might have a similar process.

Daily sync up allows us to notice anomalies. However, it is recommended to set up an alert on server response time in order to decrease mean time to detect a performance related issue. Application Insights has an alerting feature to ensure you are aware when your average server response time is behaving in an undesired way.

Set an alert

Let’s get an alert when the Average Response time is too large.

To be notified by email of unusual values of any metric, add an alert. You can choose either to send the email to the account administrators, or to specific email addresses. To add an alert rule click ‘Alert rules’ –> ‘Add Alert’.

appinsights-413setmetricalert

Triage: Understand issue severity

When you receive an alert, you want to know how severe the problem is. Does it affect all users and every page, or just some of them? Does it depend on which browser they’re using, or where they are? Application Insights lets you divide metrics by properties such as URL, browser, city, etc., so you can see correlations. To see average response times for different properties you can add a chart to your server responses blade.

PB_5
After choosing the specific property to group by, you need to scroll down to Server and select Server response time.

PB_6

You should now have the average of server response time by selected property. In this example I chose City.

test

After finding the property that affects performance most, you can filter all request by this property.

One goal of the Triage stage is to find a specific page request that is above desired server response time in order to determine root cause. After finding this individual request you are ready to Diagnose.

Previously finding an individual request meant searching through the list of all individual server request in order to find one with a response time that is unsatisfying. Linear search is not the most efficient method, and can be especially tedious in the common instance of thousands of individual requests made. Because of this we are introducing a new feature: Performance Buckets.

Performance Buckets

With performance buckets you can view the amount of server response times that fall within a certain window (bucket) of time. Simply add a new Grid chart as shown above, but group by Request Performance.

This not only gives you a nice overview of the distribution of response times, but also conveniently sorts response times into buckets. With this feature, you have direct access to requests past your desired response time (no need to scroll through the thousands of requests with fast response times).

You can also filter within a Performance Bucket. After selecting a bucket you will see all requests made within that response time window. If you want to see only the requests with a specific property, select Filters. The picture below demonstrates the process of viewing requests within 250-500ms server response time and adding the additional filter of only requests made from the city of Redmond.

filter_bucket1

To complete the filtering process click Update. The picture below shows the list of requests that fall with the 250ms-500ms range and were made from the city of Redmond.

filtered

Grouping and filtering by performance buckets allows you to understand what the average performance for fast and slow pages are. You can investigate the different possibilities. Have the number of fast pages decreased and the number of cache misses increased? Another possibility is the number of calls to slow pages has increased. Or maybe more customers are running slow reporting pages. If you are aiming to fix a server response time problem, having these performance buckets expedites the process of locating a requests with a slow response times. Now that you have found what to diagnose (the request with an undesired server response time), you are ready to move into the last phase: Diagnose.

Diagnose: Why the response time is slow

Now that you have located the instance where your server response time was slow, you can begin to investigate why it is slow. The exact reason for poor server response time is important to know so you can determine what to fixed. You can use different telemetry to pinpoint the problem area.

Sometimes a new SQL query or http request/service call can affect your average server response time. To identify which query is causing this, increase dependency monitoring may be used. Dependency monitoring counts calls to external services (databases, REST services, etc.) and logs the success or failure of each call and how long it takes to respond. If you can see a request took 4.2 seconds to service, and that 4s of it were in the warehouse server, then you know where the problem is.

Another possibility for increases in server response time is an increase in cache misses. You can use log tracing events or metrics to determine if you are experiencing a higher number of cache misses than before. Logs tracing events are particularly valuable both to record key points in a process, and to trace problems at an internal interface. Using log tracing events is also helpful when attempting to discover if the problem is occurring on the back end or front end.

The role of diagnostics is to identify whether this is a code defect that needs to be fixed, tested and released ASAP – a resource issue that could be addressed by scaling up, or a problem with an external service.

There’s more about diagnosis in Victor Mushkatin’s blog.

Summary

Thanks to Performance Buckets the move from triage to diagnose is much smoother. The new view of Performance Buckets provides you with a convenient overview and distribution of your server response times. After you detect, triage, and diagnose, you can fix the issue. You can now go into the fixing process with confidence that you are working on the actual problem.