Cloud Service Fundamentals: Telemetry basics and troubleshooting
3 min read
Editor’s Note: This post comes from Silvano Coriani from the Azure CAT Team.
In Building Blocks of Great Cloud Applications blog post, we introduced Azure CAT team series of blog posts and tech articles describing the Cloud Service Fundamentals in Windows Azure code project posted on MSDN Code Gallery. The first component we are addressing in this series is Telemetry. This has been one of the first reusable components we have built working on Windows Azure customer projects of all sizes. Indeed, someone once said: “Trying to manage a complex cloud solution without a proper telemetry infrastructure in place is like trying to walk across a busy highway with blind eyes and deft ears”. You have little to no idea of where the issues can come from, and no chances to take any smart move without getting in trouble. Instead, with an adequate level of monitoring and diagnostic information on the status of your application components over time, you will be able to take educated decisions on things like cost and efficiency analysis, capacity planning and operational excellence. This blog also has a corresponding wiki article that goes deeper into Telemetry basics and troubleshooting.
Managing system at any scale in the cloud, in fact requires a different approach in terms of performance monitoring and application health to support operational efforts. Using existing tools and techniques can be challenging due to the highly abstracted nature of a cloud platform. In addition, if your solution is required to scale, the number of information generated by hundreds of web/worker roles, database shards and additional services will generate the risk of being flooded by tons of relatively low statistically significant, uncorrelated and delayed data. Providing an end-to-end experience around operational insights will help customers to match their SLAs with their users, while reducing management costs together with the ability of taking more informed decisions on present and future resource consumption and deployment. This can only be achieved considering all the different layers involved, from an infrastructure perspective (e.g. resource usage like CPU, I/O, memory, etc.) to the application itself (database response times, exceptions, etc.) up to business activities and KPIs.
The ability of process, correlate and consume these information will benefit both operations teams (maintain service health, analyzing resource consumptions, managing support calls) and development team (troubleshooting, planning for new releases, etc.).
The telemetry solution itself has to be designed for scaling and execute data acquisition and transformation activities across multiple role instances, storing data into multiple raw data SQL Azure repositories. To facilitate reporting and analytic component although, the aggregated data will reside in a centralized database that will serve as a main data source for both pre-defined and custom reports and dashboards, as shown in this simplified architectural diagram:
Because the topic itself is quite huge, we decided to break it down in four blog posts and wiki articles, effectively creating a mini-series:
- Telemetry basics and troubleshooting
- Application instrumentation
- Data acquisition pipeline
- Reporting and analysis
The idea for this first article is to introduce the basic principles of a telemetry solution, starting from defining basic metrics and key indicators of our application health. We’re also presenting in details the various information sources that can feed an automated telemetry system or be used manually to execute troubleshooting sessions where the complexity of our application deployment is not huge.
Features like Windows Azure Diagnostics (WAD), where appropriately configured, will be a great starting point to collect and aggregate most of these critical information. Unfortunately, some of these sources are currently not integrated with WAD, Azure SQL Database as an example, and require a slightly different approach and APIs to extract these information. Azure Storage Analytics is another good example of a different service that requires a specific effort to collect and consolidate metrics.
To read on about this topic, see the Telemetry Basics and Troubleshooting wiki article, where we will then focus on the analytical approach that can be used to correlate all these different data sources into a single view that describes the end-to-end solution health state. In addition, to help in this journey, we are presenting a number of tools (Microsoft and 3rd party) and scripts that can be practically used during our troubleshooting sessions.
This will be a cornerstone for the following set of articles that we will introduce with a future post. You will find the entire series at the Cloud Service Fundamental TechNet Wiki landing page.