略過導覽

AI helps troubleshoot an intermittent SQL Database performance issue in one day

在 九月 11, 2018 上貼文

Program Manager, R&D Azure Data

In this blogpost, you will learn how Azure SQL Database intelligent performance feature Intelligent Insights has successfully helped a customer troubleshoot a hard to find 6-month intermittent database performance issue in a single day only. You will find out how Intelligent Insights helps an ISV operate 60,000 databases by identifying related performance issues across their database fleet. You will also learn how Intelligent Insights helped an enterprise seamlessly identify a hard to troubleshoot performance degradation issue on a large-scale 35TB database fleet.

Azure SQL Database, the most intelligent cloud database, is empowering small and medium size business, and large enterprises to focus on writing awesome applications while entrusting Azure to autonomously take care of running, scaling, and maintain a peak performance with a minimum of human interaction, or advanced technical skill set required.

Intelligent Insights is a new disruptive intelligent performance technology leveraging the power of artificial intelligence (AI) to continuously monitor and troubleshoot Azure SQL Database performance issues with a pinpoint accuracy and at a large scale simply not possible before. Performance troubleshooting models for the solution where fine-tuned and advanced based on the learning from a massive workload pool of several million Azure SQL Databases.

In the period since its public preview debut in Sep. 2017, we have witnessed some remarkable customer success stories with Intelligent Insights that I would like to share with you.

Troubleshooting hard-to-find intermittent SQL Database performance issues

NewOrbit is an ISV based in the United Kingdom building and running software on Azure for startups and enterprises, including a system for background checking of employees and tenants. Thousands of customers check more than 400,000 people each year and the system does that through integrating information from various services. NewOrbit runs entirely on Azure and use about 200 SQL Databases. NewOrbit has a flat organizational structure consisting of developers and customer account managers only. They do not employ DevOps team or DBAs as they rely on SQL Database built-in intelligence to automatically tune and troubleshoot database performance for them as NewOrbit CTO Frans Lytzen is saying “I have Azure for that”.

NewOrbit experienced an intermittent performance issue lasting for about 6 months that resulted in getting an increased amount of timeouts for existing systems running in production. Their first reaction was to upgrade to a higher pricing tier on Azure with more capacity, however this did not help and it seemed very strange to them. As they’ve seen more than enough spare DTUs on their Azure subscription, NewOrbit understood right away that dialing up the capacity will not make the issue go away.

NewOrbit has decided to try Intelligent Insights. As soon as they fired up the solution, it showed that there seems to exist a memory pressure on their databases. This was not immediately obvious and NewOrbit understood that memory pressure is not the cause but most likely an effect of another underlying problem.

image_000

Intelligent Insights has within a day of being turned on pointed out there are several queries suspected as a root cause for the memory pressure. NewOrbit has promptly fixed and deployed new queries to production witnessing an immediate decrease in memory pressure the following day.

“On an earlier occasion, we had a performance issue lasting for about 6 months. Before Intelligent Insights we have not had a way of figuring out where do we even start troubleshooting. Intelligent Insights gave us a list of things to do. What Intelligent Insight does is that it enables us to pinpoint where the problem is and to get a fix deployed within 24hrs.”

Frans Lytzen, CTO, NewOrbit.

As NewOrbit grows as a service, their databases and queries are getting bigger and the complexity grows. They have now implemented a continuous improvement program relying on Intelligent Insights to regularly review and optimize database queries for the best performance of their applications.

Identifying performance issues in the sea of 60,000 SQL Databases

SnelStart is a Dutch ISV developing and running SaaS service for small and large companies providing financial online services such are bookkeeping, invoicing and accounting. SnelStart relies on Azure to deliver their service to about 64,000 customers. The underlying infrastructure employs 60,000+ SQL Databases, mainly in elastic pools. Maintaining a massive amount of databases at scale and quickly reacting to customer issues, especially in cases of services degradation or an outage is an imperative for SnelStart.

To maintain such a large amount of databases typically involves having a large DBA and DevOps team. Once a customer calls on the phone complaining that the service is slow or unavailable, SnelStart team needs to quickly identify and act on resolving the issue. The company had a number of cases where it would find out about a performance or unavailability issues from their customers first. Hoping to improve their response time, SnelStart has decided to use Intelligent Insights as a performance-monitoring tool.

SnelStart uses the solution daily to help identify elastic pools hitting its CPU limits. The built-in intelligence helps the company identify top consuming databases in overutilized elastic pools, providing suggestions of other elastic pools with sufficient capacity where the hot databases could be moved. SnelStart as their DevOps strategy has created separate elastic pools on each of their logical servers where they move high consuming databases. This allows the company to research slow running queries, and other database problems with the help of Azure SQL Analytics until repaired. Once performance of hot databases has improved, SnelStart typically returns them back to the original elastic pools.

image_002

With the help of Intelligent Insights, SnelStart was also, in one of the instances, able to quickly identify application queries causing database locking, much before their developers could manually troubleshoot the root cause of the issue. Hotfix for the issue was deployed within minutes, and before customers were impacted at scale. Better yet, there were no customer phone calls received regarding the service performance.

This was an entirely new type of capability for the company, as the service monitoring has shifted from a reactive to proactive relying on SQL Database built-in intelligent performance.

“Intelligent Insights proactively finds a database performance problem in a more efficient way and much faster than humans. With it we can proactively help customers until we have a fix for the problem.”

Bauke Stil, Application Manager, SnelStart.

SnelStart was impressed with the efficiency of detecting database performance issues automatically and at such scale in the sea of 60,000 SQL Databases. As all of their customer databases are having the same structure, the company is also using the solution to identify and resolve common performance issues across their entire database fleet. Once an issue is identified on one database, a hotfix is deployed to all databases immediately benefiting all SnelStart customers.

Finding a locking issue on a large-scale 35TB SQL Database fleet

Microsoft TFS provides an online service supporting our developer community and customers. It is a complex system handling trillion lines of code across the customer base. The service provides sophisticated reporting, builds, labs, tests, ticketing, release automation management and project management – amongst others. Providing a 24/7/365 service globally with a top performance is imperative for TFS. Once you click to check-in your code, or if you are perhaps creating a TFS ticket, you expect a prompt response from the service. TFS massive infrastructure runs on about 900 SQL Databases with combined 35TB of data stored.

TFS has previously used its own monitoring and alerting solution capable of observing slow queries and sustained high CPU periods. On one of the occasions, the existing tools identified a performance issue providing only a shallow analysis that there exists a heavy locking on a database, but nothing else. Further troubleshooting has indicated the locking was table scoped and related to lock partitioning. This meant that the affected slow query visible in the stats is typically not the root cause of locking, but only a surface effect of some other blocking query causing the locking. The workload in such cases keeps on piling up, so upgrading to a larger resource pool could only be a short-lived solution. The only way out of this was to find the blocking query and to fix it.

Troubleshooting in cases such as this one is typically very difficult, as it requires considerable DBA skills and it is very time consuming. TFS turned to Intelligent Insights to help with performance troubleshooting automation. In this particular case, the system has identified application related increase in the workload pile up, and has provided a list the affected and blocking queries.

image_003

Analysis of the blocking query has identified that this was a maintenance query scheduled periodically to remove unused attachments. Further analyzing the code, it was determined that the size of the deletion batch was too large causing the issue. New query with reduced batch size was deployed promptly, having an immediate effect in resolving the heavy locking and gradually relieving the workload pressure.

Maintaining TFS demanding infrastructure requires some of the best Microsoft engineers on the job. We have surely met the bar with one of the best SQL performance troubleshooters at Microsoft, Remi Lemarchand, Principal Software Engineer at TFS, hearing him say:

“Intelligent Insights does a great job of finding database performance problems. I find it always right on the spot!”

Remi Lemarchand, Principal Software Engineer, Microsoft TFS.

Remi concludes that he prefers using Intelligent Insights first in troubleshooting database performance issues before moving onto other tools due to the level of depth it provides, and considerable reduction of manual DBA troubleshooting time required. Since applying the fix to the blocking query, the database is purring along nicely!

Summary

Marriage of AI and Azure SQL Database has resulted in disruptive new capabilities of intelligent performance not possible before. Small and medium size business can now do more with Azure on a large scale with a smaller crew and at a considerably lower cost compared to running infrastructure on their own. Large companies and enterprises can have less headaches and relieve pressure from their DBAs and DevOps as SQL Database intelligent performance can help them identify operational issues on tens of thousands of databases in a single day, compared to months in some cases.

To help you start using SQL Database Intelligent Insights for troubleshooting performance issues, see Setup Intelligent Insights with Log Analytics.

For related SQL Database intelligent performance products, see Monitor Azure SQL Databases with Azure SQL Analytics, and Automatic tuning in Azure SQL Database.

Please let us know how do you envision SQL Database intelligent performance helping you?