• 3 min read

Cloud Service Fundamentals – Introduction into Fault-Tolerant Data Access Layer

Editor's Note: This article was written by Valery Mizonov from the AzureCAT team. In our previous Telemetry – Application Instrumentation blog post we highlighted how Cloud Service…

Editor’s Note: This article was written by Valery Mizonov from the AzureCAT team.

In our previous Telemetry – Application Instrumentation blog post we highlighted how Cloud Service Fundamentals in Windows Azure code project addresses building fault-tolerant data access layer that can sustain transient failures in the Windows Azure SQL Database service. We provided a few best practices for instrumenting your own applications and gave you some hints around increasing the reliability and resiliency of your solution.

In this post, we are going to drill down into what building a resilient data access layer means in real-life terms and how we approached this critical requirement in CSFundamentals.

Building distributed cloud-based applications and services can be challenging due to pitfalls of contemporary cloud infrastructures such as resource throttling, communication failures, volatility in platform service behavior, either “by design” or “by abnormality”. A natural solution to this class of problems is to detect potential transient failures, and retry the failed operations while keeping your finger on the pulse of the application-specific Service Level Agreements (SLAs).

Making intelligent decisions on whether to retry on failures involves building a knowledge base of known transient faults and their identifies (error codes or exception types). This is a tedious and time consuming effort. Fortunately, Windows Azure developers can bypass that effort altogether and focus on things that matter the most, because the knowledge and intelligence to handle transient fault is implemented by the Enterprise Library’s Transient Fault Handling Application Block — making it easy to increase resilience of cloud-based services. The Transient Fault Handling Application Block is comprised of an easy-to-use API and a knowledge base containing all possible Windows Azure-related transient faults currently known.

The Cloud Service Fundamentals solution extensively utilizes the Transient Fault Handling Application Block to guard its data access layer from throttling conditions and other transient errors that may occur in the Windows Azure SQL Database service.

When building the data access layer CSFundamentals, we followed these important best practices from a transient fault handling perspective:

  • Use the latest release of Transient Fault Handling Application Block from NuGet to take advantage of the very latest updates to transient fault knowledge base and detection logic.
  • Determine the proper boundaries that may become exposed to transient failures and ensure these boundaries are idempotent should the application logic need to be retried upon encountering a transient fault.
  • Log all retry attempts and transient failures (including decoded throttling reasons) to enrich application telemetry with insights into these failures and facilitate troubleshooting.
  • Think in business SLA terms (not in specific retry counts and delay intervals) by enforcing time-bounded behavior of the retry logic that recognizes and appreciates the performance SLA aspect of a given end-to-end business transaction.
  • Manage the retry policies in configuration and avoid hard-coding retry strategy parameters to achieve a greater agility to support changes at runtime.
  • Apply retry logic to both SQL connections and commands, and not “either or”, keeping in mind that certain SQL commands cannot be safely retried unless these are guarded by transactions.
  • Expect long-lasting (potentially unrecoverable) transient failures and be mindful of convoy effects on recovery from them. Handle the retry pattern appropriately, for example, by exponentially increasing the delay interval between retry attempts.

To read more on about this topic, see the “Cloud Service Fundamentals Data Access Layer – Transient Fault Handling” wiki article and learn more about how we tolerate database service failures and how you can leverage similar techniques and best practices in building great cloud applications.

Thank you for taking the time to read this blog post. In the next article in this series we will explore the reporting functionality that we have implemented to project the telemetry from an instrumented application by using Windows Azure SQL Reporting service and its data visualization capabilities.