Businesses run software and software runs businesses. Code can make or break your bank or bookstore. And if it goes down or performs badly, you stand to lose not just during the outage but for ever after, as customers move elsewhere. If users are having problems with your web site, you need to know immediately, and you need to pin down and fix the problem before most customers have noticed. This isn’t just about crude availability. Yes, it’s very bad if your site goes down, and typical SLAs demand 99.999%
availability. But it’s also about responsiveness. One study shows
a degradation of 1s would cost Amazon $1.6B, while 0.4s would lose Google 8M queries. Such sites use all kinds of tactics to keep their perceived response times high: data center redundancy, content delivery networks, async rendering, and more. To keep a metric high you’ve got to measure it all the time. The key to this is telemetry: data that is collected about the performance of your app. Analysis tools show you all kinds of charts about how your app is performing and what people are doing with it, as well as letting you drill into specific incidents. And you can set alerts that get you out of bed if your app throws too many exceptions, runs slow, or vanishes altogether. Monitoring scenarios for typical SCRUD
applications come in two main types:
Availability affects public Service Level Agreements related to end user impact. “I just paid and your system went down – what have you done with my money?” or “I need to get these tickets, but your system is down.”
- Is it up? – If customers can’t access my service, I want to panic immediately.
- Reliability – Is any data lost in the pipeline?
- Responsiveness: How long is it before the user can actually use the page (Search|Read operations)?
- Latency: How long is the round trip from the user to process and store data (Create|Update|Delete)?
Performance usually affects Operational Level Agreements. For example, the data access layer team may guarantee response < 1 second for 99% of queries.
So how do you use the telemetry? Let’s talk in terms of a pipeline of three processes: Detect, Triage, Diagnose.
Problem detection is an art. It’s difficult to automate entirely, but a typical goal is that 80% of all Live Site Incidents (LSIs) should be detected automatically. The other 20% are the sort that you find yourself analyzing data or based on your customer feedback. Automatic detection follows a progression with maturity of the product and organization:
- Site availability: A ping test sends requests to your site at regular intervals from points around the globe, measuring response times, the return code, and maybe some content. Three failed tests in a row suggests that something serious is happening.
- Resource availability: A more elaborate web test simulates the main user stories, going through the motions of, for example, choosing an item and almost buying it. These tests check that your back end services are running. But on the other hand, you don’t want to mess with your real data, so they can’t test everything.
- User availability: The response times and failure rates of real user requests can be measured with the help of instrumentation in the server and client applications. This gives a much better indication of the actual user experience. Still, you have to be careful to examine the results carefully. If 1% of all HTTP requests fail, that might not sound too bad; until you discover it’s the ‘click here to pay’ button that’s failing.
But 80% automation is a goal: it isn’t where you start. You start with manual exploration of your telemetry, becoming familiar with the normal behavior of your system: how its response times vary under load, how frequently exceptions occur, which requests are the slowest or most failure-prone. During this period, you’ll learn what abnormal behavior looks like. When Live Site Incidents occur, you’ll be able to analyze changed metrics and so understand how to automate recognition of the pattern and detect that type of problem in the future.
LSIs can be anything from a tile being slow to load to a complete outage. The first thing to do is find out how bad things are. If no-one can see the home page, that’s a disaster. If some users aren’t seeing “recently viewed items”, that’s a bug. The scale of the problem determines the resources devoted to fixing it, and early triage has huge implications for operational costs. So we need to know how many customers can’t complete all the primary business scenarios. Automation at the user availability level provides the instrumentation required to count numbers of users affected. A severity score can be initially assigned by the automation that detects the LSI, and subsequently reviewed by the DevOps team that investigates. Customer impact can be determined by counting the number of unique users associated with failed requests. If your system serves users that login, you can analyze the logs to find who was affected by an issue and maintain a tight communication loop with them, without escalating the problem to the entire user population. Many users value good communication about issues more than having a service free of bugs.
Now the hunt for the cause begins. Diagnosis isn’t quite the same as bug-fixing. Not all problems are caused by code defects. Some are issues with configuration, or with storage or other resources, and some are problems in other services that you use. And you need to know whose problem it is. The front end developer claims to see orders going onto the queue, and the worker developer says the orders that come through are being processed successfully. But still, the customers can’t place orders, so there must be a problem somewhere. It is critical to understand that exact telemetry on communication boundaries helps identify who should be continuing the investigation. To help you locate the problem, this telemetry is very effective:
- Request monitoring counts requests and any failure responses, and measures response times. If when the frequency of requests goes up, there’s a sudden rise in response times, then you might suspect a memory or other resource problem.
- Capacity – performance counters measuring memory, I/O rates, and CPU, providing a direct view of resource usage.
- Dependency monitoring, which counts calls to external services (databases, REST services, and so on) and logs the success or failure of each call and how long it takes to respond. If you can see that a request to you took 4.2 seconds to service, and that 4s of that was in the warehouse server, then you know where the problem is.
- Logs tracing events are particularly valuable both to record key points in a process, and to trace problems at an internal interface. If a user can’t place an order, is this the front end or the back end problem?
- Deployment records help correlate system updates with the sudden onset of problems.
The role of diagnostics is to identify whether this is a code defect that needs to be fixed, tested and released ASAP; or a resource issue that could be addressed by scaling up; or a problem with an external service.
Operating a live, successful service is an exciting and challenging task. It requires process discipline and tooling to eliminate inefficiencies and improve agility. As organizations adopt continuous deployments, A/B testing and other agile practices, there is a clear need for a solution that brings together Application Lifecycle Management and Application Performance Management disciplines into a continuous DevOps process. Microsoft has an offering that helps you integrate APM telemetry from the very beginning of the dev cycle starting in the IDE (such as Visual Studio
) For .NET/Java/Node.JS/Python/Ruby/PHP/ObjectiveC
, through into the Azure platform (Azure Web Sites, VM
, Mobile and Cloud Services) and to Monitoring (by Application Insights
). Further reading:
An excellent blog by Netflix
on monitoring a distributed system and a blog by Brian Harry on monitoring QoS