Advancing resiliency threat modeling for large distributed systems

All service engineering teams in Azure are already familiar with postmortems as a tool for better understanding what went wrong, how it went wrong, and the customer impact of the related outage. An important part of our postmortem process is for all relevant teams to create repair items aimed at preventing the same type of outage from happening again, but while reactive approaches can yield results, we wanted to move this to the left—way left. We wanted to get better at uncovering risks before they had an impact on our customers, not after. For today’s post in our Advancing Reliability blog series I’ve asked Richie Costleigh, Principal Program Manager from our Azure problem management team, to share insights into our journey as we work towards advancing our postmortem and resiliency threat modeling processes.”—Mark Russinovich, CTO, Azure

At a high level, our goal is to avoid self-inflicted and/or avoidable outages, but even more immediately, our goal is to be able to reduce the likelihood of impact to our customers as much as possible. In this context, an outage is any incident in which our services fail to meet the expectations of (or negatively impact the workloads of) our customers, both internally and externally. To avoid that, we needed to improve how we uncover risks before they impact customer workloads on Azure. But Azure itself is a large, complex distributed system—how should we approach resiliency threat modeling if our organization offers thousands of solutions, comprised of hundreds of “services,” each with a team of five to 50 engineers, distributed across multiple different parts of the organization, and each with their own processes, tools, priorities, and goals? How do we scale our resiliency threat modeling process out, and reason across all these individual risk assessments? To address these challenges, it took some major changes to join the reactive approach with the more proactive approach.

Starting our journey

We started the shift left with a premortem pilot program. We looked back at past outages and developed a questionnaire that helped not only to start discussions but also to provide a structure to them. Next, we selected several services of varying purpose and architecture—each time we sat down with a team, we learned something new, got better and better at identifying risks, incorporated feedback from the teams on the process, then tried again with the next team. Eventually, we started to identify the right questions to ask as well as other elements we needed to make this process productive and impactful. Some of these elements already existed, others needed to be created to support a centralized approach to resiliency threat modeling. Many that existed also required changes or integration into an overall solution that met our goals. What follows is a high-level overview of our approach, and the elements we discovered were necessary to continue improving the space.

We created a culture of continuous and fearless risk documentation

We realized that there would need to be a large up-front investment to find out where our risks were. Thorough premortems take time—the valuable and “in high demand” engineer kind of time, the “we are working within a tight timeline to deliver customer value and can’t stop to do this” kind of engineer time. Sometimes we needed to help them understand that, even though their service only had one outage in two years, there are dozens of other services in the dependency chains of our solutions that have also only had one outage in the past two years. The point is that, from our customers’ perspective, they have seen more than “just” two outages.

We should be skeptical that low risk is something we can safely ignore. We must embark on healthy, fearless searching inventory of risks to outages. We needed a process that not only supports detailed discussions of lingering issues, and an assessment of the risks without fear of reprisal, but also informs investments in common solutions and mitigations that can be leveraged broadly across multiple services in a large-scale organization consisting of hundreds or thousands of services.

Our goal is a continuous search for risks, and to have “living risks” threat models instead of static models that are only updated every X months. Once the initial investment to find that initial set of risks lands, we wanted to keep things up to date with a well-defined process. The key at first was not to be deterred by how large the initial investment “moat” was, between us and our goals. The real benefit we see now is that risk collection is built into our culture. This enables risks to be collected from many diverse sources and built into our organizational processes.

So, how best to start the process of identifying risks?

We used both reactive and proactive approaches to uncovering risks

We leveraged our postmortem analysis program

While looking for risks, it makes sense to look at what has happened in the past and search for clues to what can happen again in the future. A solid postmortem analysis program was key in this regard. We already had a team that analyzed postmortems, looking for common themes and surfacing them for deeper analysis and to inform investments. Their analyses helped us not only route teams to quality programs or initiatives that could help address the risk, but also highlighted the need for other investments we did not know we needed until we saw how prevalent the risk was. The types of risk categories they identified from postmortem analysis became a subset of our focus areas when we looked for risks that had not happened yet. It is worth noting here that postmortem analysis is only as good as the postmortem itself.

We leveraged our postmortem quality review program

The quality of a postmortem determines its usefulness, therefore we invested in a Postmortem Quality Review Program. Guidance was published, training was made available, and a large pool of reviewers would rate each “high impact” outage postmortem after it was written. Postmortems that had low ratings or needed more clarity were sent back to the authors. High impact postmortems are reviewed weekly in a meeting that includes engineers from other teams and senior leaders, both of whom ask questions and give feedback around the right action plans. Having this program vastly increased our ability to learn from and act on postmortem data.

However, we knew that we could not limit our search for risks to the past.

We got better at premortems to be more proactive

Some may have heard the term “premortem,” which is a similar process to a Failure Mode Analysis (FMA).

According to the Harvard Business Review (September 2007, Issue 1), “A premortem is the hypothetical opposite of a postmortem. A postmortem in a medical setting allows health professionals and the family to learn what caused a patient’s death. Everyone benefits except, of course, the patient. A premortem in a business setting comes at the beginning of a project rather than the end so that the project can be improved rather than autopsied. Unlike a typical critiquing session, in which project team members are asked what might go wrong, the premortem operates on the assumption that the “patient” has died, and so asks what did go wrong. The team members’ task is to generate plausible reasons for the project’s failure.”

In the context of Azure, the goal of a premortem is to predict what could go wrong, identify the potential risks, and mitigate or remove them before the trigger event results in an outage. According to How to Catch a Black Swan: Measuring the Benefits of the Premortem Technique for Risk Identification, premortem techniques can identify more quality risks and propose more quality changes; it is a more effective risk management technique compared with others.

Conducting a premortem is not a terribly difficult endeavor, but you must be sure you involve the right set of people. For a particular customer solution, we gathered the most knowledgeable engineers for that solution and, even better, brought the new ones so they can learn. Next, we had them brainstorm as many reasons as possible why they may be woken up at 4:00 AM because their customers are experiencing an outage that they must mitigate. We like to start with a hypothetical question, “Imagine you were away on vacation for two weeks and came back hearing that your service had an outage. Before you find out what the contributing factors were for that outage, attempt to list as many things as possible you think could be the likely causes.” Each of those was captured as a risk, including documenting the triggers that cause the outage, the impact on customers, and the likelihood of it happening.

After we identified the risks, we needed to have an action plan to address them.

We created “Risk Threat Models”

If premortem is the searching and fearless inventory of risks, the risk threat model is what combines that with the action plan. If your team is doing failure mode analysis or regular risk reviews, this effort should be straightforward. The risk threat model takes your list of risks and builds in what you expect to do to reduce customer pain.

After we identified the risks, we needed to have a common understanding of the right fixes so that those patterns could be used everywhere there was a similar risk. If those fixes are long-term, we asked ourselves “what can we do in the meantime to reduce the likelihood or impact to our customers”? Here are some of the questions we asked:

What telemetry exists to recognize the risk when it is triggered?
What if this happens tomorrow? What guardrails or processes are in place until the final fix is completed?
How long does it take to do this interim mitigation and is it automated? Can it be automated? Why not?
Has the mitigation been tested? When was the last time you tested it?
What have you done to ensure you have added the most resiliency possible, for instance, if your critical dependencies have an outage? Have you worked with that dependency to ensure you are following their guidance and using their latest libraries?
Are there any places where you can ask for an increase in COGS to be more resilient?
What mitigations are you not doing due to the cost or complexity or because you do not have the resources?

Risk Mitigation Plans were documented, aligned with common solutions, and tracked. Any teams that indicated the mitigations would need more developers or money to implement needed mitigations were provided a forum to ask for it in the context of resiliency and quality. Work items were created for all the tasks needed to make the service as resilient as possible and linked to the risks we documented so we could follow up on their completion.

But how do we know we are precluding the risk correctly? How do we know what the “right” mitigation strategies are? How could we reduce the amount of work it took, and prevent everyone from solving the same problem in a custom way?

We created a centralized repository of all risks across all services

Risks Threat Models are more effective when captured and analyzed centrally

If every team went off and analyzed their postmortems, conducted a premortem, created a risk threat model, and then documented everything separately there is no doubt we would be able to reduce the number of outages we have. However, by not having a way for teams to share what they found with others, we would have missed opportunities to understand broader patterns in the risk data. Risks in one service were often risks in other services as well, but for one reason or another, it did not make it into the risk threat models of all services that were at risk. We also would not have noticed that unfinished repairs or risks in one service are risks to other services. We would have missed the chance to document patterns that will prevent this risk from appearing elsewhere when a new service is spun up. We would not have realized that the scope of a premortem should not be limited to individual service boundaries, but rather done across all the services that work together for a particular customer scenario. We would have missed opportunities to propose common mitigation strategies and invest in broad efforts to address mitigations at scale using the same mitigation plan. In short, we would have missed many opportunities to inform numerous investments across service boundaries.

We implemented a common way of categorizing risks

In some cases, we knew we needed to understand what “types” of risks we were finding. Were they Single Points of Failure? Insufficiently configured throttling? To that end, we created a hierarchical system of shorthand “tags” that were used to describe categories of issues on which we wanted to focus. These tags were used for analyzing postmortems to identify common patterns, as well as marking individual risks so that we could better look across risks in the same categories to identify the right action plans.

We had regular reviews of the Risk Threat Models

Having the completed Risk Threat Models enabled us to schedule reviews in front of senior leadership, architects, members of the dedicated cross-Azure Quality Team, and others. These meetings were more than just reviews, they provided an opportunity to come together as a diverse team to identify areas for which we needed common solutions, mitigations, and follow-up actions. Action items were collected, owners assigned, and risks were then linked with the action plans so we could follow up down the road to determine how teams progressed.

It all came together, time to take it to the next level and do this for hundreds of services!

In summary, it took more than just spinning up a program to identify and document risks. We needed to inspire, but also have the right processes in place to get the most out of that effort. It took coordination across many programs and the creation of many others. It took a lot of cross-service-team communication and commitment.

Accelerating the resiliency threat modeling Program has already yielded many benefits for our critical Azure services, so we will be expanding this process to cover every service in Azure. To this end, we are continuously refining our process, documentation, and guidance as well as leveraging past risk discussions to address new risks. Yes, this is a lot of work, and there is no silver bullet, and we are still bringing more and more resources into this effort, but when it comes to reliability, we believe in “go big”!

Advancing resiliency threat modeling for large distributed systems

Starting our journey

We created a culture of continuous and fearless risk documentation

We used both reactive and proactive approaches to uncovering risks

We leveraged our postmortem analysis program

We leveraged our postmortem quality review program

We got better at premortems to be more proactive

We created “Risk Threat Models”

We created a centralized repository of all risks across all services

Risks Threat Models are more effective when captured and analyzed centrally

We implemented a common way of categorizing risks

We had regular reviews of the Risk Threat Models

It all came together, time to take it to the next level and do this for hundreds of services!

Mark Russinovich posts

Azure reliability, resiliency, and recoverability: Build continuity by design

Project Flash update: Advancing Azure Virtual Machine availability monitoring

Optimizing incident management with AIOps using the Triangle System

Explore Microsoft Foundry

Advancing resiliency threat modeling for large distributed systems

Starting our journey

We created a culture of continuous and fearless risk documentation

We used both reactive and proactive approaches to uncovering risks

We leveraged our postmortem analysis program

We leveraged our postmortem quality review program

We got better at premortems to be more proactive

We created “Risk Threat Models”

We created a centralized repository of all risks across all services

Risks Threat Models are more effective when captured and analyzed centrally

We implemented a common way of categorizing risks

We had regular reviews of the Risk Threat Models

It all came together, time to take it to the next level and do this for hundreds of services!

Related posts

Azure reliability, resiliency, and recoverability: Build continuity by design

Project Flash update: Advancing Azure Virtual Machine availability monitoring

Optimizing incident management with AIOps using the Triangle System

Explore Microsoft Foundry