In Telemetry Basics and Troubleshooting we introduced the basic principles around monitoring and application health by looking at fundamental metrics, information sources, tools, and scripts that the Windows Azure platform provides. We showed how you can use these to troubleshoot a simple solution deployed on Windows Azure (few compute node instances, single Windows Azure SQL Database instance). In this post we expand on that entry and cover the application instrumentation aspects of the telemetry system that was implemented in the Cloud Service Fundamentals in Windows Azure code project. In the detailed wiki entry that accompanies this blog, we show how you can use the CSF instrumentation framework, which integrates with Windows Azure Diagnostics (WAD), to provide a consistent instrumentation experience for your application. The techniques we have implemented in the CSF application have been proven on large-scale Azure deployments.
The best source of information about your applications is the applications themselves. However, while good tools and a robust telemetry system make acquiring information easier, if you don’t instrument your application in the first place you cannot get at that information at all. In addition, if you don’t consistently instrument across all your application components, you are unlikely to achieve operational efficiency when you begin scaling in production. (Troubleshooting problems becomes far more complex than individuals – or even teams – can tackle in real-time.) Consistent, application-wide instrumentation and a telemetry which can consume it is the only way to extract the information you need to keep your application running well, at scale, with relative efficiency and ease.
CSF provides a number of components that you can use to quickly instrument your application and build an effective telemetry system:
- Data access layer that implements retry logic and provides sensible retry policies designed for scale.
- A logging framework built on top of NLOG
- Scalable custom configuration for WAD that will support scaling.
- A data pipeline that collects and moves this information into a queryable telemetry system.
- A sample set of operational telemetry reports you can use to monitor your application
By adopting these practices and using the components and configuration we have provided you can help your system scale as well as give you the insight to target your development effort more precisely and improve your operational efficiency — which ultimately makes your customers happier for fewer resources. This allows you to provide a high quality user experience, and identify upcoming problems before your users do. There is a corresponding wiki article that goes deeper into “Telemetry: Application Instrumentation”
It’s very easy to read this and yet be just too busy growing your user base and deploying new code features.
Distrust this feeling. Many, many companies have had a hot product or service that at some point couldn’t scale and experienced one — or more — extended outages. Users often have little fidelity to any system that is unreliable, they may choose to just move elsewhere — perhaps to the upstart that is chasing on your heels and ready to capture your market.
Of course some of you may already have built your own application instrumentation framework and implemented many of the best practices. For that reason we have provided the CSF application in whole including all the telemetry components as source code on the MSDN Code Gallery. Some of the key things to remember as you implement instrumentation in your application:
- Create separate channels for chunky (high-volume, high-latency, granular data) and chatty (low-volume, low-latency, high-value data) telemetry.
- Use standard Windows Azure Diagnostics sources, such as performance counters and traces, for chatty information.
- Log all API calls to external services with context, destination, method, timing information (latency), and result (success/failure/retries). Use the chunky logging channel to avoid overwhelming the telemetry system with instrumentation information.
- Log the full exception details, but do not use exception.ToString()
- Data written into table storage (performance counters, event logs, trace events) are written in a temporal partition that is 60 seconds wide. Attempting to write too much data (too many point sources, too low a collection interval) can overwhelm this partition. Ensure that error spikes do not trigger a high volume insert attempt into table storage, as this might trigger a throttling event.
- Collect database and other service response times using the stopwatch approach
- Use common logging libraries, such as the Enterprise Application Framework Library, log4net or NLog, to implement bulk logging to local files. Use a custom data source in the diagnostic monitor configuration to copy this information periodically to blob storage.
- Do not publish live site data and telemetry into the same storage account. Use a dedicated storage account for diagnostics.
- Choose an appropriate collection interval (5 min – 15 min) to reduce the amount of data that must be transferred and analyzed, for example, “PT5M”.
- Ensure that logging configuration can be modified at run-time without forcing instance resets. Also verify that the configuration is sufficiently granular to enable logging for specific aspects of the system, such as database, cache, or other services.
Thank you for taking the time to read this blog post. To learn more about how to implement the CSF instrumentation components in your application there is a corresponding wiki article that goes deeper into “Telemetry: Application Instrumentation”. In the next article in this series we will explore the data pipeline that we have implemented to provide a comprehensive view of the overall CSF application and its performance characteristics; including how we capture this information in a relational operational store and provide you with an overall view across the Azure platform.