{"id":169,"date":"2021-06-30T00:00:00","date_gmt":"2021-06-30T00:00:00","guid":{"rendered":"https:\/\/azure.microsoft.com\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf"},"modified":"2025-06-25T04:59:34","modified_gmt":"2025-06-25T11:59:34","slug":"advancing-safe-deployment-with-aiops-introducing-gandalf","status":"publish","type":"post","link":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/","title":{"rendered":"Advancing safe deployment with AIOps\u2014introducing Gandalf"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"has-large-font-size wp-block-paragraph\"><em>Changes to Azure services and the Azure platform itself are both inevitable and beneficial, to ensure continuous delivery of updates, new features, and security enhancements. However, change is also a primary cause of service regressions that can contribute towards reliability issues\u2014for hyperscale cloud providers, indeed for any IT service provider. As such, it is critical to catch any such problems as early as possible during the development and deployment rollout, to minimize any impact on the customer experience. As part of our ongoing <a href=\"https:\/\/www.aka.ms\/AdvancingReliability\" target=\"_blank\" rel=\"noopener\">Advancing Reliability<\/a> blog series, today I\u2019ve asked Principal Program Manager <strong>Jian Zhang<\/strong> from our AIOps team to introduce how we\u2019re increasingly leveraging machine learning to de-risk these changes, ultimately to improve the reliability of Azure.\u201d<\/em>\u2014Mark Russinovich, CTO, Azure<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<p class=\"wp-block-paragraph\"><em>This post includes contributions from Principal Data Scientists Ken Hsieh and Ze Li, Principal Data Scientist Manager Yingnong Dang, and Partner Group Software Engineering Manager Murali Chintalapati.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In our earlier blog post \u201c<a href=\"https:\/\/azure.microsoft.com\/blog\/advancing-safe-deployment-practices\/\" target=\"_blank\" rel=\"noopener\">Advancing safe deployment practices<\/a>\u201d Cristina del Amo Casado described how we release changes to production, for both code and configuration changes, across the <a href=\"https:\/\/azure.microsoft.com\/en-us\/\" target=\"_blank\" rel=\"noopener\">Azure<\/a> platform. The processes consist of delivering changes progressively, with phases that incorporate enough bake time to allow detection at a small scale for most regressions missed during testing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The continuous monitoring of health metrics is a fundamental part of this process, and this is where AIOps plays a critical role\u2014it allows the detection of anomalies to trigger alerts and the automation of correcting actions such as stopping the deployment or initiating rollbacks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the post that follows, we introduce how AI and machine learning are used to empower DevOps engineers, monitor the Azure deployment process at scale, detect issues early, and make rollout or rollback decisions based on impact scope and severity.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-aiops-for-safe-deployment\">Why AIOps for safe deployment<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">As defined by <a href=\"https:\/\/www.gartner.com\/en\/documents\/3892967\/market-guide-for-aiops-platforms\" target=\"_blank\" rel=\"noopener\">Gartner<\/a>, AIOps enhances IT operations through insights that combine big data, machine learning, and visualization to automate IT operations processes, including event correlation, anomaly detection, and causality determination. In our earlier post, &#8220;<a href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-azure-service-quality-with-artificial-intelligence-aiops\/\" target=\"_blank\" rel=\"noopener\">Advancing Azure service quality with artificial intelligence: AIOps<\/a>,&#8221; we shared our vision and some of the ways in which we are already using AIOps in practice, including around safe deployment. AIOps is well suited to catching failures during deployment rollout, particularly because of the complexities of cross-service dependencies, the scale of hyperscale cloud services, and the variety of different customer scenarios supported.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Phased rollouts and enriched health signals are used to facilitate monitoring and decision making in the deployment process, but the volume of signals and level of complexity involved in deployment decision making exceeds what any human could reasonably reason over, across thousands of ever-evolving service components, spanning more than 200 datacenters in more than 60 regions. Some latent issues won\u2019t manifest for several days after their deployment, and global issues that span different clusters but manifest only minutely in any individual cluster are hard to detect with just a local watchdog. While loose coupling allows most service components to be deployed independently, their deployments could have intricate impacts on each other. For example, a simple change in an upstream service could potentially impact a downstream service if it breaks the contract of API calls between the two services.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">These challenges call for automated monitoring, anomaly detection, and rollout impact assessment solutions to facilitate deployment decisions at velocity.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/47cf1def-15ef-41dd-b9bb-69eb6ed10a44.webp\" alt=\"Gandalf safe deployment \u00e2\u20ac\u201c including pre-qualification test, safe deployment policy, local watchdog, and \u00e2\u20ac\u0153Gandalf\u00e2\u20ac the global and intelligent watchdog.\" title=\"gandalf img1\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Figure 1: Gandalf safe deployment<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"gandalf-safe-deployment-service-an-aiops-solution\">Gandalf safe deployment service: An AIOps solution<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Rising to the challenge described above, the Azure Compute Insights team developed the \u201cGandalf\u201d safe deployment service\u2014an end-to-end, continuous monitoring system for safe deployment. We consider this part of the Gandalf AIOps solution suite, which includes a few other intelligent monitoring services. The code name Gandalf was inspired by the protagonist from The Lord of the Rings, as shown in Figure 1, it serves as a global watchdog, which makes intelligent deployment decisions based on signals collected. It works in tandem with local watchdogs, safe deployment policies, and pre-qualification tests, all to ensure deployment safety and velocity.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As illustrated in Figure 2, the Gandalf system monitors rich and representative signals from Azure, performs anomaly detection and correlation, then derives insights to support deployment decision making and actions.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/7c1aa8d8-2f29-4752-bc86-b2eeecdd0a59.webp\" alt=\"Gandalf system overview \u00e2\u20ac\u201c showing data sources, the detection\/correlation\/decision engine, result orchestration, consumers, and the deployment engine.\" title=\"Figure2\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Figure 2: Gandalf system overview<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"data-sources\">Data sources<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Gandalf monitors signals across performance, failures, and events as described below. It pre-processes the data to structure them around a unified data schema to support downstream data analytics. It also leverages a few other analytics services within Azure for health signals, including our Virtual Machine failure categorization service and near real-time failure attribution processing service. Signal registration with Gandalf is required when any new service components are onboarded, to ensure complete coverage.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Performance data: <\/strong>Gandalf monitors performance counters, CPU usage, memory usage, and more \u2013 all for a high-level view of performance and resource consumption patterns of hosted services.<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>Failure signals:<\/strong> Gandalf monitors both the hosting environment of customer\u2019s virtual machines (data plane) and tenant-level services (control plane). For the data plane, it monitors failure signals such as OS crashes, node faults, and reboots to evaluate the health of the VM\u2019s hosting environment.\u00a0 At the same time, it monitors failure signals of the control plane like API call failures, to evaluate the health of tenant-level services.<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>Update events:<\/strong> In addition to telemetry data collected, Gandalf also keeps its finger on the pulse of deployment events, which report deployment progress and issues.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"detection-correlation-and-decision\">Detection, correlation, and decision<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Gandalf evaluates the impact scope of the deployment\u2014for example, the number of impacted nodes, clusters, and customers\u2014to make a go\/no-go decision using decision criteria that are trained dynamically. To balance speed and coverage, Gandalf utilizes an architecture with both streaming and batch analysis engines.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/bac9753f-e782-4ece-987b-214cb2b012c9.webp\" alt=\"Gandalf correlation process (identifying which rollouts are suspicious) and decision process (assessing the customer impacts of the blamed components\/failures).\" title=\"gandalf img3\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Figure 3: Gandalf Anomaly Detection and Correlation Mode<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Figure 3 shows an overview of the Gandalf Machine Learning (ML) model. It consists of two parts\u2014anomaly detection and correlation process (to identify suspicious deployments) and a decision process (to evaluate customer impact).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"anomaly-detection-and-correlation-process\">Anomaly detection and correlation process<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To ensure precise detection, Gandalf derives fault signatures from input signals, which can be used to uniquely identify the failure. Then, it detects based on the occurrence of the fault signature.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In large-scale cloud systems like Azure, simple threshold-based detection is not practical both because of the dynamic nature of the systems and workloads hosted and because of the sheer volume of fault signatures. Gandalf applies machine learning techniques to estimate baseline settings based on historical data automatically and can adapt the setting through training as needed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When Gandalf detects an anomaly, it correlates the observed failure with deployment events and evaluates its impact scope. This helps to filter out failures caused by non-deployment reasons such as random firmware issues.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Since multiple system components are often deployed concurrently, a vote-veto mechanism is used to establish the relationship between the faults and the rollout components. In addition, temporal and spatial correlations are used to identify the components at fault. Fault age, which measures the time between rollout and detection of fault signature, is considered to allow more focus on new rollouts than old ones since newly observed faults are less likely to be triggered by the old rollout.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this way, Gandalf can detect an anomaly that would lead to potential regressions in the customer experience early in the process\u2014before it generates widespread customer impact. For more detail, refer to our published paper \u201c<a href=\"https:\/\/www.usenix.org\/conference\/nsdi20\/presentation\/li\" target=\"_blank\" rel=\"noopener\">Gandalf: An Intelligent, end-to-end analytics service for safe deployment in large-scale cloud infrastructure<\/a>.\u201d<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"decision-process\">Decision process<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, Gandalf evaluates the impact scope of the deployment such as the number of impacted clusters\/nodes\/customers, and ultimately makes a &#8220;go\/no-go&#8221; decision. It\u2019s worth mentioning that Gandalf is designed to allow developers to customize signals\u2019 weight assignment based on their experience. In this way, it can incorporate domain knowledge from human experts to complement its machine learning solutions.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"result-orchestration\">Result orchestration<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To balance speed and coverage, Gandalf utilizes both streaming and batch processing of incoming signals. Streaming processing consumes data from Azure Data Explorer, a cloud storage solution supporting analytics with fast speed. Streaming processing is used to process fault signals that happen 1 hour before and after each deployment in each node and runs lightweight analysis algorithms for rapid response.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Batch processing consumes data from Cosmos, a Hadoop-like file system that supports extremely large volumes of data. It\u2019s used to analyze faults over a larger time window (generally a 30-day period) with advanced algorithms.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Both stream and batch processing are performed incrementally with five-minute intervals. In general, the incoming telemetry signals of Gandalf are both streamed into Kusto and stored into Cosmos hourly\/daily. With the same data source, occasionally there could be inconsistent results from the processing pipeline. This is by design since batch processing makes more informed decisions and covers latent issues that the fast\/streaming process cannot detect.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"deployment-experience-transformation\">Deployment experience transformation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The Gandalf system is now well integrated into our DevOps workflow within Azure and has been widely adopted for deployment health monitoring across the entire fleet. It not only helps to prevent bad rollouts as quickly as possible but has also transformed the engineers\u2019 and release managers\u2019 experience in deploying software changes\u2014from looking for scattered evidence to using a single source of truth, from ad-hoc diagnoses to using interactive troubleshooting\u2014and in so doing, many of the engineers who interact with Gandalf have had their opinions on it transformed as well, evolving from skeptics to advocates.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In many Azure services, Gandalf has become a default baseline for all release validations, and it\u2019s exciting to hear how much our on-call engineers trust Gandalf.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"summary\">Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In this post, we have introduced the <a href=\"https:\/\/www.usenix.org\/conference\/nsdi20\/presentation\/li\" target=\"_blank\" rel=\"noopener\">Gandalf<\/a> safe deployment service, an intelligent, end-to-end analytics service for the safe deployment of Azure services. Through state-of-the-art anomaly detection, special and temporal correlation, and result orchestration, the Gandalf safe deployment service enables DevOps engineers to make go\/no-go decisions accurately, and with the velocity needed by hyper-scale cloud platforms like Azure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We will continue to invest in applying AI- and machine learning-based technologies to improve cloud service management, ultimately to continue improving the customer experience. Look for us to share more about our AIOps solutions, including pre-production analytics to further help us push quality to the left.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The continuous monitoring of health metrics is a fundamental part of this process, and this is where AIOps plays a critical role. In the post that follows, we introduce how AI\u00a0and machine learning\u00a0are used to empower DevOps engineers,\u00a0monitor the Azure deployment process at scale, detect issues early, and make rollout or rollback decisions based on impact scope and severity.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ms_queue_id":[],"ep_exclude_from_search":false,"_classifai_error":"","_classifai_text_to_speech_error":"","_alt_title":"","footnotes":"","msx_community_cta_settings":[]},"categories":[1474],"tags":[],"audience":[3057,3055,3056],"content-type":[1481],"product":[1522],"tech-community":[],"topic":[],"coauthors":[21],"class_list":["post-169","post","type-post","status-publish","format-standard","hentry","category-analytics","audience-data-professionals","audience-developers","audience-it-implementors","content-type-thought-leadership","product-azure-data-explorer","review-flag-1680286581-295","review-flag-1-1680286581-825","review-flag-2-1680286581-601","review-flag-3-1680286581-173","review-flag-artif-1680286586-345","review-flag-gartn-1680286585-730","review-flag-machi-1680286585-314","review-flag-ml-1680286585-776","review-flag-new-1680286579-546","review-flag-partn-1680286579-901","review-flag-vm-1680286585-143"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.2 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Advancing safe deployment with AIOps\u2014introducing Gandalf | Microsoft Azure Blog<\/title>\n<meta name=\"description\" content=\"The continuous monitoring of health metrics is a fundamental part of this process, and this is where AIOps plays a critical role. In the post that follows, we introduce how AI\u00a0and machine learning\u00a0are used to empower DevOps engineers,\u00a0monitor the Azure deployment process at scale, detect issues early, and make rollout or rollback decisions based on impact scope and severity.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Advancing safe deployment with AIOps\u2014introducing Gandalf | Microsoft Azure Blog\" \/>\n<meta property=\"og:description\" content=\"The continuous monitoring of health metrics is a fundamental part of this process, and this is where AIOps plays a critical role. In the post that follows, we introduce how AI\u00a0and machine learning\u00a0are used to empower DevOps engineers,\u00a0monitor the Azure deployment process at scale, detect issues early, and make rollout or rollback decisions based on impact scope and severity.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/\" \/>\n<meta property=\"og:site_name\" content=\"Microsoft Azure Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/microsoftazure\" \/>\n<meta property=\"article:published_time\" content=\"2021-06-30T00:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-25T11:59:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/47cf1def-15ef-41dd-b9bb-69eb6ed10a44.webp\" \/>\n<meta name=\"author\" content=\"Mark Russinovich\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@azure\" \/>\n<meta name=\"twitter:site\" content=\"@azure\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Mark Russinovich\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/\"},\"author\":[{\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/author\/mark-russinovich\/\",\"@type\":\"Person\",\"@name\":\"Mark Russinovich\"}],\"headline\":\"Advancing safe deployment with AIOps\u2014introducing Gandalf\",\"datePublished\":\"2021-06-30T00:00:00+00:00\",\"dateModified\":\"2025-06-25T11:59:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/\"},\"wordCount\":1696,\"publisher\":{\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/47cf1def-15ef-41dd-b9bb-69eb6ed10a44.webp\",\"articleSection\":[\"Analytics\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/\",\"url\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/\",\"name\":\"Advancing safe deployment with AIOps\u2014introducing Gandalf | Microsoft Azure Blog\",\"isPartOf\":{\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/47cf1def-15ef-41dd-b9bb-69eb6ed10a44.webp\",\"datePublished\":\"2021-06-30T00:00:00+00:00\",\"dateModified\":\"2025-06-25T11:59:34+00:00\",\"description\":\"The continuous monitoring of health metrics is a fundamental part of this process, and this is where AIOps plays a critical role. In the post that follows, we introduce how AI\u00a0and machine learning\u00a0are used to empower DevOps engineers,\u00a0monitor the Azure deployment process at scale, detect issues early, and make rollout or rollback decisions based on impact scope and severity.\",\"breadcrumb\":{\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#primaryimage\",\"url\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/47cf1def-15ef-41dd-b9bb-69eb6ed10a44.webp\",\"contentUrl\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/47cf1def-15ef-41dd-b9bb-69eb6ed10a44.webp\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Blog home\",\"item\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Analytics\",\"item\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/category\/analytics\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Advancing safe deployment with AIOps\u2014introducing Gandalf\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/#website\",\"url\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/\",\"name\":\"Microsoft Azure Blog\",\"description\":\"Get the latest Azure news, updates, and announcements from the Azure blog. From product updates to hot topics, hear from the Azure experts.\",\"publisher\":{\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/#organization\",\"name\":\"Microsoft Azure Blog\",\"url\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2024\/06\/microsoft_logo.webp\",\"contentUrl\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2024\/06\/microsoft_logo.webp\",\"width\":512,\"height\":512,\"caption\":\"Microsoft Azure Blog\"},\"image\":{\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/microsoftazure\",\"https:\/\/x.com\/azure\",\"https:\/\/www.instagram.com\/microsoftdeveloper\/\",\"https:\/\/www.linkedin.com\/company\/16188386\",\"https:\/\/www.youtube.com\/user\/windowsazure\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/#\/schema\/person\/c702e5edd662b328b49b7e1180cab117\",\"name\":\"shakir\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/secure.gravatar.com\/avatar\/9342c7c05bb16548741bc5cd3a3e3b7ee0c8e746844ad2cc582db5beb5514c6f?s=96&d=mm&r=g7664e653ea371ce16eaf75e9fa8952c4\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/9342c7c05bb16548741bc5cd3a3e3b7ee0c8e746844ad2cc582db5beb5514c6f?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/9342c7c05bb16548741bc5cd3a3e3b7ee0c8e746844ad2cc582db5beb5514c6f?s=96&d=mm&r=g\",\"caption\":\"shakir\"},\"sameAs\":[\"https:\/\/azure.microsoft.com\"],\"url\":\"https:\/\/azure.microsoft.com\/en-us\/blog\/author\/shakir\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Advancing safe deployment with AIOps\u2014introducing Gandalf | Microsoft Azure Blog","description":"The continuous monitoring of health metrics is a fundamental part of this process, and this is where AIOps plays a critical role. In the post that follows, we introduce how AI\u00a0and machine learning\u00a0are used to empower DevOps engineers,\u00a0monitor the Azure deployment process at scale, detect issues early, and make rollout or rollback decisions based on impact scope and severity.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/","og_locale":"en_US","og_type":"article","og_title":"Advancing safe deployment with AIOps\u2014introducing Gandalf | Microsoft Azure Blog","og_description":"The continuous monitoring of health metrics is a fundamental part of this process, and this is where AIOps plays a critical role. In the post that follows, we introduce how AI\u00a0and machine learning\u00a0are used to empower DevOps engineers,\u00a0monitor the Azure deployment process at scale, detect issues early, and make rollout or rollback decisions based on impact scope and severity.","og_url":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/","og_site_name":"Microsoft Azure Blog","article_publisher":"https:\/\/www.facebook.com\/microsoftazure","article_published_time":"2021-06-30T00:00:00+00:00","article_modified_time":"2025-06-25T11:59:34+00:00","og_image":[{"url":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/47cf1def-15ef-41dd-b9bb-69eb6ed10a44.webp","type":"","width":"","height":""}],"author":"Mark Russinovich","twitter_card":"summary_large_image","twitter_creator":"@azure","twitter_site":"@azure","twitter_misc":{"Written by":"Mark Russinovich","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#article","isPartOf":{"@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/"},"author":[{"@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/author\/mark-russinovich\/","@type":"Person","@name":"Mark Russinovich"}],"headline":"Advancing safe deployment with AIOps\u2014introducing Gandalf","datePublished":"2021-06-30T00:00:00+00:00","dateModified":"2025-06-25T11:59:34+00:00","mainEntityOfPage":{"@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/"},"wordCount":1696,"publisher":{"@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/#organization"},"image":{"@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#primaryimage"},"thumbnailUrl":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/47cf1def-15ef-41dd-b9bb-69eb6ed10a44.webp","articleSection":["Analytics"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/","url":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/","name":"Advancing safe deployment with AIOps\u2014introducing Gandalf | Microsoft Azure Blog","isPartOf":{"@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#primaryimage"},"image":{"@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#primaryimage"},"thumbnailUrl":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/47cf1def-15ef-41dd-b9bb-69eb6ed10a44.webp","datePublished":"2021-06-30T00:00:00+00:00","dateModified":"2025-06-25T11:59:34+00:00","description":"The continuous monitoring of health metrics is a fundamental part of this process, and this is where AIOps plays a critical role. In the post that follows, we introduce how AI\u00a0and machine learning\u00a0are used to empower DevOps engineers,\u00a0monitor the Azure deployment process at scale, detect issues early, and make rollout or rollback decisions based on impact scope and severity.","breadcrumb":{"@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#primaryimage","url":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/47cf1def-15ef-41dd-b9bb-69eb6ed10a44.webp","contentUrl":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2021\/06\/47cf1def-15ef-41dd-b9bb-69eb6ed10a44.webp"},{"@type":"BreadcrumbList","@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/advancing-safe-deployment-with-aiops-introducing-gandalf\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog home","item":"https:\/\/azure.microsoft.com\/en-us\/blog\/"},{"@type":"ListItem","position":2,"name":"Analytics","item":"https:\/\/azure.microsoft.com\/en-us\/blog\/category\/analytics\/"},{"@type":"ListItem","position":3,"name":"Advancing safe deployment with AIOps\u2014introducing Gandalf"}]},{"@type":"WebSite","@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/#website","url":"https:\/\/azure.microsoft.com\/en-us\/blog\/","name":"Microsoft Azure Blog","description":"Get the latest Azure news, updates, and announcements from the Azure blog. From product updates to hot topics, hear from the Azure experts.","publisher":{"@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/azure.microsoft.com\/en-us\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/#organization","name":"Microsoft Azure Blog","url":"https:\/\/azure.microsoft.com\/en-us\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2024\/06\/microsoft_logo.webp","contentUrl":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-content\/uploads\/2024\/06\/microsoft_logo.webp","width":512,"height":512,"caption":"Microsoft Azure Blog"},"image":{"@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/microsoftazure","https:\/\/x.com\/azure","https:\/\/www.instagram.com\/microsoftdeveloper\/","https:\/\/www.linkedin.com\/company\/16188386","https:\/\/www.youtube.com\/user\/windowsazure"]},{"@type":"Person","@id":"https:\/\/azure.microsoft.com\/en-us\/blog\/#\/schema\/person\/c702e5edd662b328b49b7e1180cab117","name":"shakir","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/9342c7c05bb16548741bc5cd3a3e3b7ee0c8e746844ad2cc582db5beb5514c6f?s=96&d=mm&r=g7664e653ea371ce16eaf75e9fa8952c4","url":"https:\/\/secure.gravatar.com\/avatar\/9342c7c05bb16548741bc5cd3a3e3b7ee0c8e746844ad2cc582db5beb5514c6f?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/9342c7c05bb16548741bc5cd3a3e3b7ee0c8e746844ad2cc582db5beb5514c6f?s=96&d=mm&r=g","caption":"shakir"},"sameAs":["https:\/\/azure.microsoft.com"],"url":"https:\/\/azure.microsoft.com\/en-us\/blog\/author\/shakir\/"}]}},"msxcm_display_generated_audio":false,"msxcm_animated_featured_image":null,"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Microsoft Azure Blog","distributor_original_site_url":"https:\/\/azure.microsoft.com\/en-us\/blog","push-errors":false,"_links":{"self":[{"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/posts\/169","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/comments?post=169"}],"version-history":[{"count":1,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/posts\/169\/revisions"}],"predecessor-version":[{"id":43306,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/posts\/169\/revisions\/43306"}],"wp:attachment":[{"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/media?parent=169"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/categories?post=169"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/tags?post=169"},{"taxonomy":"audience","embeddable":true,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/audience?post=169"},{"taxonomy":"content-type","embeddable":true,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/content-type?post=169"},{"taxonomy":"product","embeddable":true,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/product?post=169"},{"taxonomy":"tech-community","embeddable":true,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/tech-community?post=169"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/topic?post=169"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/azure.microsoft.com\/en-us\/blog\/wp-json\/wp\/v2\/coauthors?post=169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}