The data acquisition pipeline lies at the core of our Telemetry solution, which is part of the Cloud Service Fundamentals in Windows Azure. This pipeline is the subject of the third article in our wiki series. Its role in our solution is to extract all the different information from various repositories and aggregate them into a single relational database. This can then be used to correlate and analyze long-term trends on captured metrics, drill down into specific issues and incidents, and feed both pre-defined and ad-hoc reports and dashboards (but wait! This will be the topic of our next articleJ). Questions might include, “what was going on in my application components when customers reported that performance degradation?” or, “what changed in my data tier when the application generated timeout errors between 12AM and 1PM UTC?” Answers to questions like these are exactly the kind of insights that we want to enable with our end-to-end telemetry solution.
Figure 1 – Data acquisition pipeline highlight in the overall Telemetry solution architecture
From an architectural perspective, we can break down the design of our data acquisition pipeline into three main areas:
- A configurable scheduler engine that executes import and transformation tasks.
- An extensible set of tasks that queries various information sources, applies some transformation logic, and pushes the results into a centralized repository.
- A Windows Azure SQL Database instance that stores this data in a common relational schema, which can be used to execute analytical queries and extract meaningful information.
While implementing the scheduler engine, we decided to adopt a “pull” mechanism that queries various data sources on a time interval basis. There are a number of reason to prefer this approach over the more complex “push” and “streaming” approaches. The main one is that data is already stored in intermediate repositories like Windows Azure Storage (for information generated by Windows Azure Diagnostics) or Windows Azure SQL Database internal data structures (for Dynamic Management Views). We did not have strict requirements for real-time, continuous analysis, and we also wanted to keep the implementation as simple as possible. In current release, our scheduler is implemented as a Windows Azure worker role that reads its configuration from an XML file hosted in a Windows Azure Storage blob container. This file defines the number and types of scheduled jobs to execute and their execution frequencies, plus all the configuration options and connection strings to various data sources and destinations. It is possible to change the configuration file at any time, adding or removing tasks or changing a specific configuration, and the worker role will pick up the updated file and start executing accordingly. We ran a number of production deployments of this scheduler component, working with customers and partners monitoring their solutions. We found that a “Small” VM size for this worker role was enough to deal with hundreds of target compute instances and databases.
We also implemented a number of import tasks, which connect to two main families of diagnostic sources:
- Windows Azure Diagnostics (WAD) generated data
- Windows Azure SQL Database internal state exposed through DMVs
Because WAD stores most information in Windows Azure Storage tables, many import tasks are very similar. They basically query source tables using Windows Azure Storage Client Library v2.0 contexts for a given time window (startdate to enddate), apply some immediate filters (e.g. critical or error events), re-shape or transform data rows on the fly, and bulk copy new information into the destination database. This category includes information such as Windows Azure Compute Nodes performance counters, Windows Event Logs, and Trace Logs. A more complex implementation involves collecting IIS Logs across all web roles by WAD, which are stored in blob containers organized with a specific structure. For these, we first have to query a table (that contains the references to all log files generated in a particular time window) and then download these IIS Log files on the worker role local storage to analyze and extract meaningful information, such as web page and API response times.
For SQL Database instances, we create a specific import task that queries publicly available DMVs across all target databases (using a parallel fan-out library) and extracts a variety of useful information, from query and request statistics to connection errors, missing indexes, and database size.
As a future improvement, we are considering adding information from Windows Azure Storage Analytics and Caching diagnostics.
All of this diagnostics information is then stored in a single relational schema (hosted in SQL Database as well) using a common, time-based key definition and optimized for time interval queries. The simple reason for this is that it’s very likely that you will use to investigate a specific incident or understand trends over time.
As I mentioned before, in the next article we’re going to address how to query and extract interesting information from our OpsStatsDB, with a focus on the Table Valued Functions (TVFs) layer. We created these TVFs to simplify the implementation of ad-hoc reports using Excel. We’ll also look at the set of Windows Azure SQL Reporting Services reports and dashboards that are available together with the Cloud Service Fundamentals code package.
You will find the entire series at the Cloud Service Fundamental TechNet Wiki landing page.