Root-Cause Analysis with In-Query Machine Learning in Application Insights Analytics

5월 15, 2017에 게시됨

Principal Product Manager, Azure

There are now several language constructs in Application Insights Analytics that implement machine learning algorithms. Analytics is the powerful query language that lets you analyze usage and performance telemetry from your web app.

These operators let you apply Machine Learning (ML) directly within the query flow, without detailed knowledge of the underlying techniques. In this article, we’ll see how you can perform efficient root-cause analysis with their help.

We have released two families of ML constructs:

  • Intelligent insights into query results
  • Time series analysis – regression and more

In this blog, we will tour them by example. Whether you are looking for ways to better work with your logs, or just an ML fan, I am sure you will enjoy reading about these new ML capabilities.

Tour 1: Intelligent insights into query results

Focus on the essence of your query result with autocluster.

A basic step in root-cause analysis, and other log analytics flows, is to understand query results and use this understanding to build the next query. When there are many rows and columns this is far from trivial.

Wouldn’t it be nice to have a quick summary of query results instead of running additional queries on each column, just to understand that a natural grouping of results comes from values in column #23.

Autocluster is your query buddy, which will automatically outline the results of your query.

  • What does autocluster do? Summarizes the result of a query into a small number of natural clusters. Each cluster is a set of input rows that are similar to each other. Clusters may overlap.
  • What is autocluster used for? Cases where a small summary of query results is needed for a quick understanding of data.

Example: The result of query|evaluate autocluster(), where query returns about 4,000 exceptions:

image

  • How does autocluster work? Advanced unsupervised learning finds clusters that are maximal in size, but minimal in the number of wildcards.
  • What is great about autocluster? You do not need to specify anything about the grouping! It is learned automatically, saving you a process of statistical querying to understand that. In the example above, we use autocluster to learn that the data has 3 dominant clusters, lines 4 and 5 are all roes and almost all rows.
  • Warning: autocluster is addictive Smile

Diagnose an incident with diffpatterns

Another core building block for root-cause analysis is to juxtapose two behaviors, and discover the differences between them. You can see, for instance, which patterns characterize your failed requests vs. successful requests, slow transactions vs. the rest, validate a hypothesis regarding long response times, and so on.

diffpatterns gives you the power to diagnose by comparison. This is done by surfacing machine-learned guesses on root causes, described in terms of data patterns.

  • What does diffpatterns do? Compares two datasets with the same schem, for example, incident-time logs vs. normal-time logs.
  • What is diffpatterns used for? Cases where you want to diagnose or characterize by comparison. For example, which patterns are suspiciously represented in incident-time logs.

Example: You investigate an incident and write a query that has a binary column ’isIncident’, indicating whether a row belongs to an incident.

query |evaluate diffpatterns(split=”isIncident”) returns patterns mutual to the sets, and their percentage representation:

clip_image004[1]

  • How does diffpatterns work? Unsupervised learning is used to find frequent patterns that are common to the two row sets, and then computes the differences in their respective occurrences.
  • What’s great about diffpatterns? It reduces dependency of the RCA process on intuition or highly-specific knowledge, by allowing quick progress in investigation paths, or dismissal of paths.
  • In the example below titled, “Investigating Application Failures,” you can see how to combine diffpatterns and autocluster.

Efficiently mine all patterns of your query results with basket

In the above example for autocluster, we wanted to group a set of exceptions to a small number of high-level and diverged clusters that we can look at quickly. However, if we further process the patterns appearing in a query, we can produce all of them (above a predefined size threshold), even if there are many. basket can help here.

  • What does basket do? Efficiently mines all frequent combinations of values in a query result.
  • What is basket used for? Cases where you want all patterns, even if many patterns exist.

Example: query | evaluate basket() where query is the same as in the example for autocluster, and returns the below:

clip_image006[1]

  • How does basket work? Based on the Apriori Algorithm, one of the pearls of Machine Learning, which is a learning alternative to exhaustive search of all frequent patterns of categorical features.
  • What is great about basket? Apart from efficient computation, it allows you to retrieve all frequent patterns without having to limit their size or state it explicitly in the query!

Tour 2: Time-Series Analysis – Regression and more

Monitoring computing resources over time is an essential angle for diagnosis of problems. Systems produce a large number of time series and mining them efficiently will help you find the needle in the haystack. For this reason, we introduce time series as a new data type in Analytics, along with a set of language constructs which enable you to do advanced analytics from the time perspective.

make-series is the basic operator which transforms tabular data into a new data type, an ordered series of numeric values. Typically used for time series, you can create the series axis and value with the flexibility of the rest of the language.

Then, you can use shape detectors to expedite your investigation processes, again within the querying flow, being able to operate over a large number of series. Let’s tour them by example.

Search for anomalous performance counters with series_fit_line

Healthy systems are steady, predictable, and exhibit a regular pattern. When things go wrong, they appear as anomalies in the data. One of the ways to detect anomalous behavior is to learn the normal behavior from the data, and surface values that do not conform to it. series_fit_line will help you do that, particularly when you have many time series.

  • What does series_fit_line do? Finds the best straight line through your data, and tells how well it fits.
  • What is series_fit_line used for? Cases where you want to compute the trend of a series.

Example: You are being reported of service slowdowns from in one of your deployments, and would like to look at memory performance counters. However, even 10 series are impossible to make sense out of manually, and real systems have thousands. Here is what a tiny fraction of your data looks like. Where does root-cause analysis start?!

clip_image008[1]

You’d like to start with counters with anomalous points. series_fit_line will help you determine if a given series has anomalous values, by comparing the following two types of behaviors.

  1. When there is a difference between the values produced by series_fit_line and the original values are similar to each other.  
  2. Cases where there is a spike above the average difference, as illustrated below. Now you can put thresholds and retrieve counters with anomalous points, as a starting point for investigation from.

clip_image010[1]

  • How does series_fit_line work? Linear regression is computed on the series values, which means that it finds the line of best fit to the data, together with a measure of the goodness of the fit.
  • What’s great about series_fit_line? It is useful in a variety of scenarios, in addition to anomaly detection including future values assuming a trend, visualization, ranking of series according to slopes, and more.

Investigate a memory leak with series_fit_2lines

Not all bad behaviors can be described as a deviation from a line. For example, a memory leak exhibits values which slowly increase over time, without any of them being a spike. series_fit_2lines takes another step forward in helping to detect such bad behaviors. It learns the best description of s series using two lines.

  • What does series_fit_2lines do? Fits two straight lines to your data, one line up to a certain point in time, and another one from the point.
  • What is series_fit_2lines used for? Cases where you want to quantify a change from one trend to another.

Example: The blue line illustrates a typical behavior of a memory counter during a leak. At first, all is good. At some point, the counter starts going up slowly. series_fit_2lines finds the trend change automatically and red lines. Now you can investigate processes that started to run around the time of change in trend, and hopefully find the leak.

clip_image012[1]

  • How does series_fit_2line work? Runs linear regression on splits of the data to two windows, and finds the best combination.
  • What is great about series_fit_2lines used for? It’s a powerful ML-based trend change detection tool that automates tedious manual work and leads you directly to the point of change. As in the previous example, you can use it to extract the highest changing series from a large number of series.

The following examples are not classical ML, but very useful in advanced analytics of time series:

Investigating a bug with series_periods_detect and series_periods_validate

Many times, healthy services have periodic, predictable patterns. A break of this pattern may indicate a bug interfering your users’ experience. In this example, we will see how you can use series_periods_detect and series_periods_validate to narrow down on a problem.

  • What does series_periods_detect do? Automatically finds regular periods in your series. For example, website usage typically has daily and/or weekly periodic pattern.
  • What is series_periods_detect used for? Cases where you want to know what periods compose your data, typically to monitor for broken patterns.

Example: You are being asked to investigate a customer complaint on slow service response for a large service operating on many server clusters, but you don’t know exactly where the problem is coming from.

You can use series_periods_detect to automatically extracts the general pattern of the data patterns, and see what the prevailing periods are (weekly, daily, or combinations). Then, look for segments, in our case a specific cluster/server, which breaks this pattern. This may well indicate that the problem is coming from this anomalous segment.

The overall traffic pattern of all the pages looks like this:

clip_image014[1]

 

For signals in which you know have a weekly or daily (or other) period, you can use series_periods_detect to validate this.

The problematic cluster/server will not have this pattern, so a starting point for this investigation would be to use series_periods_detect, to give you segments which break this pattern, which are your suspects.

Here are examples for segments which have period of less than one day:

clip_image016[1]

You can even see that there used to be a period, but something happened, for example a bug, which caused a problem and may have influenced the users.

  • How does series_periods_detect work? By applying the Fast Fourier Transform to select candidate periods from the frequency domain, then filter false candidates using auto correlation.
  • What is great about series_periods_detect? You don’t need to write different queries for data with different periods! ML picks the parameters for you.
  • If you know in advance, based on domain knowledge, the periods that you are looking for, you can use series_periods_validate to score the given periods and reduce influence of noise.

Detect anomalously long request durations with series_outliers

Service requests duration is one of the key metrics of its health. In this example, we use series_outliers to automatically identify values that are anomalously high, relative to other values in the series, thus may be indicative of a problem.

  • What does series_outliers do? Automatically finds anomalous values in your series, high or low. For example, healthy request duration should be stable, and outliers may indicate a problem.
  • What is series_outliers used for? Cases where you want to automatically retrieve points that are anomalous relatively to the series they come from, possibly from a large number of series.

Example: You would like to create a dashboard in which segments corresponding to long request duration are highlighted. You can use series_outliers to automatically score anomalous values and use the score to decide what to show on the dashboard. In the example below, you can see the anomalousness score (green line) for every data point (blue line).

anomaly_picture

Advanced series processing with series_fir, series_iir and interpolation functions

These constructs are useful for flexible processing of time series:

  • series_fir applies the Finite Impulse Response filter on values of a finite window from a series. Weights are given to each of the elements in the window, and the result is a weighted sum of the values. Popular examples for series_fir are rolling the average over a fixed time window, and computing the derivative of a series. In the next section, you will find hands-on examples for series_fir.
  • series_iir applies the Infinite Impulse Response filter, which is a recursive filter, useful for cases where using series_fir requires a too-big time window or a window of unknown size. The result of the filter is a combination of the window values and of past values from the filter. A popular example is computing the cumulative sum of series elements, where the filter recursively adds the current value to the sum computed so far. In the next section, you will find hands-on examples for series_iir.
  • series_fill_linear(), series_fill_const(), series_fill_forward(), and series_fill_backward() are interpolation functions. Interpolation functions are used to fill missing values, which are typical to real data, where data might not exist (e.g., low traffic), might be lost, or corrupted and cleansed out. With these functions you can fill missing data points as a linear function of existing data around the missing window, with a constant, or by copying values forward or backward.

Working examples

Put your hands on some ready-made examples:

Example

Operators used

Investigating Application Failures

autocluster, diffpatterns

Advanced Detection on Time Series with Shape Detectors

make-series, series_fit_line

Analyzing performance degradation

make-series, series_line_fit

Analyzing Concurrency of a Service

series_iir

Detection Of service Disruptions

make-series, series_fit_line, series_fit_2lines

Code Flow Performance

make_series, series_fir

Application Performance Monitoring

make_series, series_periods

Analyzing Usage Metrics

make-series, series_iir

Summary

We presented a set of Machine Learning and other advanced constructs which empower Analytics users with tools to quickly investigate their data, and extract non-trivial insights from it. The constructs are native to the query language, and are useful for both interactive and automated scenarios.

Machine Learning is revolutionizing information technology, domain by domain. Log analytics is no exception. As logs pile up, exciting opportunities to unlock insights from them arise. Machine Learning is a major player in the game of saving you time by automating tedious tasks, telling you the essence of data, or surprising you with intelligent guesses.

Interested in learning more? Visit Analytics Language Reference.

Thank you for reading this blog!