Root-Cause Analysis with In-Query Machine Learning in Application Insights Analytics

Gepost op 15 mei, 2017

Principal Product Manager, Azure

There are now several language constructs in Application Insights Analytics that implement machine learning algorithms. Analytics is the powerful query language that lets you analyze usage and performance telemetry from your web app.

These operators let you apply ML directly within the query flow, without a detailed knowledge of the underlying techniques. In this article, we’ll see how you can perform efficient root-cause analysis with their help.

We have released two families of ML constructs:

  • Intelligent insights into query results
  • Time series analysis – regression and more

In this blog, we will tour them by example. Whether you are looking for ways to better work with your logs, or just an ML fan, I am sure you will enjoy reading about these new ML capabilities.

Tour 1: Intelligent Insights into Query Results

Focus on the essence of your query result with autocluster.

A basic step in root-cause analysis, and other log analytics flows, is to understand a query results, and use this understanding to build the next query. When there are many rows and columns, this is far from trivial.

Wouldn’t it be nice to have a quick summary of query results instead of running additional queries on each column, just to understand that a natural grouping of results comes from values in column #23

Autocluster is your query buddy, which will automatically outline the results of your query.

  • What does autocluster do? Summarizes the result of a query into a small number of natural clusters. Each such cluster is a set of input rows that are similar to each other. Clusters may overlap.
  • What is autocluster used for? Cases where a small summary of query results is needed for quick understanding of data.

Example: the result of query|evaluate autocluster(), where query returns about 4,000 exceptions:

image

 

  • How does autocluster work? Advanced unsupervised learning finds clusters that are maximal in size, but minimal in the number of wildcards.
  • What is great about autocluster? You do not need to specify anything about the grouping! It is learned automatically, saving you a process of statistical querying to understand that. In the example above, we use autocluster to learn that the data has 3 dominant clusters, lines 4 and 5 are all roes and almost all rows.
  • Warning: autocluster is addictive Smile

Diagnose an incident with diffpatterns

Another core building block for root-cause analysis is to juxtapose two behaviors, and discover the differences between them. You can see, for instance, which patterns characterize your failed requests vs. successful requests, slow transactions vs. the rest, or validate a hypothesis regarding long response times, and so on.

diffpatterns gives you the power to diagnose by comparison, by surfacing machine-learned guesses on root causes, described in terms of data patterns.

  • What does diffpatterns do? Compares two datasets with the same schema - e.g. incident-time logs vs. normal-time logs.
  • What is diffpatterns used for? Cases where you want to diagnose or characterize by comparison - e.g. which patterns are suspiciously represented in incident-time logs.

For example: You investigate an incident and write a query that has a binary column ’isIncident’, indicating whether a row belongs to an incident.

query |evaluate diffpatterns(split=”isIncident”) returns patterns mutual to the sets, and their percentage representation:

clip_image004[1]

  • How does diffpatterns work? Unsupervised learning is used to find frequent patterns that are common to the two row sets, and then computes the differences in their respective occurrences.
  • What’s great about diffpatterns? It reduces dependency of the RCA process on intuition or highly-specific knowledge, by allowing quick progress in investigation paths, or dismissal of paths.
  • In one of the examples below, titled ‘Investigating Application Failures’ you can see how to combine diffpatterns and autocluster.

Efficiently mine all patterns of your query results with basket

In the above example for autocluster, we wanted to group a set of exceptions to a small number of high-level and diverged clusters that we can look at quickly. However, if we further process the patterns appearing in a query, we can produce all of them (above a predefined size threshold), even if there are many. basket can help here.

  • What does basket do? Efficiently mines all frequent combinations of values in a query result.
  • What is basket used for? Cases where you want all patterns, even if many patterns exist.

Example: query | evaluate basket() where query is the same as in the example for autocluster, returns:

clip_image006[1]

  • How does basket work? Based on the Apriori Algorithm, one of the pearls of Machine Learning, which is a learning alternative to exhaustive search of all frequent patterns of categorical features.
  • What is great about basket? Apart from efficient computation, it allows you to retrieve all frequent patterns without having to limit their size or state it explicitly in the query!

Tour 2: Time-Series Analysis – Regression and more

Monitoring computing resources over time is an essential angle for diagnosis of problems. Systems produce a large number of time series and mining them efficiently will help you find the needle in the haystack. Exactly for this, we introduce time series as a new data type in Analytics, along with a set of language constructs, which enable you to do advanced analytics from the time perspective.

make-series is the basic operator which transforms tabular data into a new data type an ordered series of numeric values. Typically used for time series, you can create the series axis and value with the flexibility of the rest of the language.

Then, you can use shape detectors to expedite your investigation processes, again within the querying flow, being able to operate over a large number of series. Let’s tour them by example.

Search for anomalous performance counters with series_fit_line

Healthy systems are steady, predictable, and exhibit a regular pattern. When things go wrong, they appear as anomalies in the data. One of the ways to detect anomalous behavior is to learn the normal behavior from the data, and surface values that do not conform to it. series_fit_line will help you do that, particularly when you have many time series.

  • What does series_fit_line do? Finds the best straight line through your data, and tells how well it fits.
  • What is series_fit_line used for? Cases where you want to compute the trend of a series.

Example: You are being reported of service slowdowns from in one of your deployments, and would like to look at memory performance counters. However, even 10 series are impossible to make sense out of manually, and real systems have thousands. Here is what a tiny fraction of your data looks like. Where does root-cause analysis start?!

clip_image008[1]

You’d like to start with counters with anomalous points. series_fit_line will help you determine if a given series has anomalous values, by comparing the following two types of behaviors.

  1. When difference between the values produced by series_fit_line and the original values are similar to each other
  2. Cases when there is a spike above the average difference, as illustrated below. Now you can put thresholds and retrieve counters with anomalous points, as a starting point for investigation from.

clip_image010[1]

  • How does series_fit_line work? Linear regression is computed on the series values, which means that it finds the line of best fit to the data (together with a measure of the goodness of the fit).
  • What’s great about series_fit_line? It is useful in a variety of scenarios, in addition to anomaly detection: future values assuming a trend, visualization, ranking of series according to slopes, and more.

Investigate a memory leak with series_fit_2lines

Not all bad behaviors can be described as a deviation from a line. For example, a memory leak exhibits values which slowly increase over time, without any of them being a spike. series_fit_2lines takes another step forward in helping to detect such bad behaviors. It learns the best description of s series using two lines.

  • What does series_fit_2lines do? Fits two straight lines to your data, one line up to a certain point in time, and another one from the point.
  • What is series_fit_2lines used for? Cases where you want to quantify a change from one trend to another.

Example: The blue line illustrates a typical behavior of a memory counter during a leak. At first, all is good. At some point, the counter starts going up slowly. series_fit_2lines finds the trend change automatically, red lines. Now you can investigate processes that started to run around the time of change in trend, and hopefully find the leak.

clip_image012[1]

  • How does series_fit_2line work? Runs linear regression on splits of the data to two windows, and finds the best combination.
  • What is great about series_fit_2lines used for? It’s a powerful ML-based trend change detection tool that automates tedious manual work and leads you directly to the point of change. As in the previous example, you can use it to extract the highest changing series from a large number of series.

The following examples are not classical ML, but very useful in advanced analytics of time series:

Investigating a bug with series_periods

Benign website usage has a regular, periodic, and pattern. Break of this pattern may indicate a bug interfering your users’ experience. In this example, we will see how you can use series_periods to narrow down on problematic pages.

  • What does series_periods do? Automatically finds regular periods in your series. For example, website usage typically has periodic usage pattern.
  • What is series_periods used for? Cases where you want to know what periods compose your data, typically to know of cases where the pattern is broken.

Example: You are being asked to investigate a customer complaint on broken links in your site, but you don’t know exactly which links. You can use series_periods to automatically extracts usage patterns of all your pages, and see what the prevailing periods are, weekly, daily, and combinations. Then, look for pages which break this pattern. This may indicate that some of the links to them are broken.

The overall usage pattern of all the pages looks like this:

clip_image014[1]

series_periods() will automatically extract that the dominant periods are 1 day and 7 days periods. The buggy pages will not have this pattern, so a starting point for this investigation would be to use series_periods() to give you pages which break this pattern are your suspects. Here are series of pages which have period of less than one day:

clip_image016[1]

You can see that there use to be a period, but something happened, for example broken link, which made users not reach pages.

  • How does series_periods work? Applying the Fourier Transform on the data and cleaning noise with Machine Learning.
  • What is great about series_periods? You don’t need to write different queries for data with different periods! ML picks the parameters for you.

Advanced series processing with series_fir and series_iir

These constructs are useful for flexible processing of time series:

  • series_fir applies the Finite Impulse Response filter on values of a finite window from a series. Weights are given to each of the elements in the window, and the result is a weighted sum of the values. Popular examples for series_fir are rolling average over a fixed time window, and computing the derivative of a series.

In the next section, you will find hands-on examples for series_fir.

  • series_iir applies the Infinite Impulse Response filter, which is an advanced recursive filter, useful for cases where using series_fir requires a too-big time window or a window of unknown size. The result of the filter is a linear function of the window values and of past values from the filter. A popular example is computing cumulative sum, where the filter recursively adds the current value to the sum computed so far.

In the next section, you will find hands-on examples for series_iir.

Working Examples

Put your hands on some ready-made examples:

Example

Operators used

Investigating Application Failures

autocluster, diffpatterns

Advanced Detection on Time Series with Shape Detectors

make-series, series_fit_line

Analyzing performance degradation

make-series, series_line_fit

Analyzing Concurrency of a Service

series_iir

Detection Of service Disruptions

make-series, series_fit_line, series_fit_2lines

Code Flow Performance

make_series, series_fir

Application Performance Monitoring

make_series, series_periods

Analyzing Usage Metrics

make-series, series_iir

Summary

We presented a set of Machine Learning and other advanced constructs which empower Analytics users with tools to quickly investigate their data, and extract non-trivial insights from it. The constructs are native to the query language, and are useful for both interactive and automated scenarios.

Machine Learning is revolutionizing information technology, domain by domain. Log analytics is no exception. As logs pile up, exciting opportunities to unlock insights from them arise. Machine Learning is a major player in the game of saving you time by automating tedious tasks, telling you the essence of data, or surprising you with intelligent guesses.

Interested in learning more? Visit Analytics Language Reference.

Thank you for reading this blog!