For information about how
to configure Azure Storage Library retry policies, SCL 2.0 – Implementing Retry Policies
by Gaurav Mantri is excellent. But if you want practical guidance on what
retry policy settings to use, that’s harder to find. This post offers some recommendations based on one Microsoft team’s actual experience using the SCL in high-load scenarios (for low-traffic scenarios, the default retry policies are fine).
ExponentialRetry vs. LinearRetry
For a batch process where you’re not concerned about keeping response times short, the ExponentialRetry class sounds like the easy call at first. You want to retry quickly to make sure that you clear a transient error as fast as possible but you don’t want to hammer the server, thereby causing more trouble for an already sick service. And the Azure Storage team continues to tweak the policy to make it more intelligent and provide the best overall performance.
However, consider the impact on your ability to track the quality of your connection to the Storage service. If you use ExponentialRetry with very long timeouts and many retries, you’ll avoid having to handle exceptions for most transient errors, but you won’t know if they’re happening frequently. You could track response times, but you won’t know if the cause is transient errors.
One solution is to use OperationContext.RequestResults, which contains results for every operation that was executed by the client library under the covers. OperationContext also provides end-to-end tracing that can be useful for diagnosing issues in a distributed system. If you want to be notified about retries you can use a new event called OperationContext.Retrying. Unfortunately, there is no documentation that shows examples of how to use OperationContext.
Another option if you want more diagnostic information is to use the LinearRetry class with a relatively short retry interval and just a few retries so that it will fail fairly fast. Then you can catch the exceptions and implement your own backoff while still reporting the failure. Note that the backoff is really important if you want most requests to eventually succeed.
The IRequestOptions interface also includes a MaximumExecutionTime property. This value limits the total time that can be spent on all retries. Depending on the type of operation that you are performing this may need to be very large, as big operations can take a while to fail. In high load conditions with requests for big operations, we found that values below 10 seconds resulted in a lot of failures. Setting MaximumExecutionTime to 60 seconds avoided exceptions. This works well for a background process; for customer-facing scenarios you’ll need to tune differently.
We found the ServerTimeout and maximum number of retries values were less impactful. We set them to five seconds and ten retries and that worked fine. Again, this is for a background process in which we care more about eventual success than fast response times. Also, this would not work in all scenarios -- if your application is downloading 1 TB blobs, for example, 5 seconds wouldn't be long enough. Another option if you don't want timeouts is to set ServerTimeout to null. Starting in StorageClient Library 4.0, null will be the default value.
Avoid Unnecessary Work
For some operations the SCL API provides IfExists methods that you can use to avoid exceptions: Example:
foreach (IListBlobItem blobItem in this.BlobList())
CloudBlockBlob cloudBlob = (CloudBlockBlob)blobItem;
This looks like good defensive programming and it is, but it is also an additional chance to fail and added traffic. In our stress testing it failed often. And it’s unnecessary if you know that the item exists. Changing the code to call Delete
instead of DeleteIfExists
made the operation perform much better and fail less often. So it’s best to use known information to reduce traffic and provide fewer chances to fail.
Even with a generous retry policy, sometimes errors will persist long enough for you to get an exception. The Azure Storage Client framework does a good job of making sure that these will be either StorageException or System.AggregateException
Also, the retry policy classes do not retry on 4xx status codes. There are a few others as well (currently 306, 501 and 505). These codes represent situations that are not transient and that you need to deal with. Common examples are 404 (not found) and 409 (conflict). If you write a custom retry policy, make sure that you check for these situations.
Wrapper Library Unnecessary
We started our experience with the Azure Storage Client planning to design a wrapper library that did retries that way we wanted them done. In the end that turned out to be unnecessary. We are still looking at writing libraries for our business logic to centralize our retry tuning and error handling, but they will be based on the Azure Storage Client code.
Thanks to Allen Prescott for doing the testing and providing the content for this post.