A central truth in cloud computing is that failure is inevitable. As systems scale, we expect nodes to fail ungracefully in random and unexpected ways, networks to experience sudden partitions, and messages to be dropped at any time.
Rather than fight this truth, we embrace it. We plan for failure and design our systems to be fault-tolerant, resilient, and self-stabilizing. But once we’ve finished designing and building, how do we verify that our beautiful, fault-tolerant systems actually react to failures as they should?
Functional tests can only do so much. Distributed systems are complex ecosystems of moving parts. Each component is subject to failure, and more than that, its interactions with other system components will also have their own unique failure modes. We can sit around and armchair-theorize all we like about how these components will respond to imagined failures, but finding every possible combination of failure is just not feasible. Even if we do manage to exhaustively account for every failure mode our system can encounter, it’s not sustainable or practical to re-verify system responses in this way every time we make a change to its behavior.
Chaos Engineering
Azure Search uses chaos engineering to solve this problem. As coined by Netflix in a recent excellent blog post, chaos engineering is the practice of building infrastructure to enable controlled automated fault injection into a distributed system. To accomplish this, Netflix has created the Netflix Simian Army with a collection of tools (dubbed “monkeys”) that inject failures into customer services.
Inspired by the first member of the Netflix Simian Army, Chaos Monkey, we’ve created our own “Search Chaos Monkey” and set it loose to wreak havoc against a test environment.
The environment contains a search service that is continuously and randomly changing topology and state. Service calls are made against this service on a regular basis to verify that it is fully operational.
Even just setting up this target environment for the Search Chaos Monkey to play in has been incredibly helpful in smoking out issues with our provisioning and scaling workflows. When the Search Chaos Monkey is dormant, we expect the test service to operate smoothly. Any errors coming from it can therefore be assumed to be caused by bugs in existing workflows or false positives from the alerting system. We’ve caught several bugs this way before they had a chance to escape into production.
Quantifying Chaos
After the test service was stabilized, we unleashed the Search Chaos Monkey and gave it some tools of destruction to have fun with. It runs continuously and will randomly choose an operation at regular intervals to run in its test environment.
The set of possible operations the monkey can choose from depends on the level of chaos we’ve told it to cause in our test environment.
Low chaos refers to failures that our system can recover from gracefully with minimal or no interruption to service availability. Accordingly, while the Search Chaos Monkey is set to run only low chaos operations, any alerts raised from the test service are considered to be bugs.
Medium chaos failures can also be recovered from gracefully, but may result in degraded service performance or availability, raising low priority alerts to engineers on call.
High chaos failures are more catastrophic and will interrupt service availability. These will cause high priority alerts to be sent to on-call engineers and often require manual intervention to fix.
High chaos operations are important for ensuring that our system can fail in a graceful manner while maintaining the integrity of customer data. Along with medium chaos operations, they also function as negative tests that verify alerts are raised as expected, enabling engineers to respond to the problem.
All operations can also be run on demand with a manual call to the Search Chaos Monkey.
Downgrading Failures from Exceptional to Expected
These levels of chaos provide a systematic and iterative path for an error to become incorporated into our infrastructure as a known and expected failure.
Any failure that we encounter or add to our system is assigned a level of chaos to measure how well our system reacts to it. The failure could be a theoretical one that we want to train our system to handle, a bug found from an existing chaos operation, or a repro of a failure experienced by a real search service. Either way, we will have ideally first automated any new failure so that it can be easily triggered by a single call to the Search Chaos Monkey.
In addition to the low, medium and high levels described above, the failure can be classified as causing an “extreme” level of chaos.
Extreme chaos operations are failures that cause ungraceful degradation of the service, result in data loss, or that simply fail silently without raising alerts. Since we cannot predict what state the system will be in after running an extreme chaos failure, an operation with this designation would not be given to the Search Chaos Monkey to run on a continuous basis until a fix was added to downgrade it to high chaos.
Driving chaos level as low as possible is the goal for all failures we add to our system, extreme or otherwise.
If it’s an extreme chaos failure that puts our system on the floor, then it’s essential we at least enable our infrastructure to preserve service integrity and customer data, letting it be downgraded to a high chaos failure. Sometimes this means that we sacrifice service availability until a real fix can be found.
If it’s a high chaos failure that requires a human to mitigate, we’ll try to enable our infrastructure to auto-mitigate as soon as the error is encountered, removing the need to contact an engineer with a high priority alert.
We like it any time the Search Chaos Monkey breaks our system and finds a bug before customer services are affected, but we especially like it when the monkey uncovers high and extreme level chaos failures. It’s much more pleasant to fix a bug at leisure before it is released than it is to fix it in a bleary-eyed panic after receiving a middle-of-the-night phone call that a customer’s service is down.
Medium chaos errors are also welcome, if not as urgent. If the system is already capable of recovery, we can try to improve early detection so steps can be taken before engineers are notified that service availability is impacted. The less noise the engineers on call need to deal with, the more effective they can be at responding to real problems.
Automation is the key to this process of driving down chaos levels. Being able to trigger a specific failure with minimal effort allows us to loosely follow a test-driven development approach towards reduction of chaos levels. And once the automated failure is given to the Search Chaos Monkey, it can function as a regression test to ensure future changes to our system do not impact its ability to handle failure.
Chaos Engineering in Action
To illustrate how this works, here’s a recent example of a failure that was driven from extreme chaos to low chaos using this model.
Extreme chaos: Initial discovery. A service emitted an unexpected low-priority alert. Upon further investigation, it ended up being a downstream error signifying that at least one required background task was not running. Initial classification put this error at extreme chaos – it left the cluster in an unknown state and didn’t alert correctly.
High chaos: Mitigation. At this point, we were not able to automate the failure since we were not aware of the root cause. Instead, we worked to drive the failure down to a level of high chaos. We identified a manual fix that worked, but impacted availability for services without replicas. We tuned our alerting to the correct level so that the engineer on call could perform this manual fix any time the error occurred again. Two unlucky engineers were woken up to do so by high priority alerts before the failure was fixed.
Automation. Once we were sure that our customer services were safe, we focused our efforts on reproducing the error. The root cause ended up being unexpected stalls when making external calls that were impacting unrelated components. Finding this required the addition of fault injection to cause artificial latency in the component making the calls.
Low chaos: Fix and verification. After the root cause was identified, the fix was straightforward. We decoupled the component experiencing latency in calls from the rest of the system so that any stalls would only affect that component. Some redundancy was introduced into this component so that its operation was no longer impacted by latency, or even isolated stalls, only prolonged and repeated stalls (a much rarer occurrence).
We were able to use the automated chaos operation to prove that the original failure was now handled smoothly without any problems. The failure that used to wake up on-call engineers to perform a potentially availability-impacting fix could now be downgraded to a low chaos failure that our system could recover from with no noise at all.
At this point, the automated failure could be handed off to the Search Chaos Monkey to run regularly as a low chaos operation and continually verify our system’s ability to handle this once-serious error.
Chaos Engineering and the Cloud
At Azure Search, chaos engineering has proven to be a very useful model to follow when developing a reliable and fault tolerant cloud service. Our Search Chaos Monkey has been instrumental in providing a deterministic framework for finding exceptional failures and driving them to resolution as low-impact errors with planned, automated solutions.