In the world of cloud database services, few things are more important to customers than having uninterrupted access to their data. In industries like online gaming and financial services that experience high transaction rates, even the smallest interruptions can potentially impact the end-user’s experience. Azure SQL Database is evergreen, meaning that it always has the latest version of the SQL Engine, but maintaining this evergreen state requires periodic updates to the service that can take the database offline for a second. For this reason, our engineering team is continuously working on innovative technology improvements that reduce workload interruption.
Figure 1 – This is what hot patching looks like under the covers. If you’re interested in the low-level details, see our technical blog post.
The SQL Engine we are running in Azure SQL Database is the very latest version of the same engine customers run on their own servers, except we manage and update it. To update SQL Server or the underlying infrastructure (i.e., Azure Service Fabric or the operating system), we must stop the SQL Server process. If that process hosts the primary database replica, we move the replica to another machine, requiring a failover.
During a failover, the database may be offline for a second and still meet our 99.995 percent SLA. However, failover of the primary replica impacts workload because it aborts in-flight queries and transactions. We built features such as resumable index (re)build and accelerated database recovery to address these situations, but not all running operations are automatically resumable. It may be expensive to restart complex queries or transactions that were aborted due to an upgrade. So even though failovers are quick, we want to avoid them.
SQL Server and the overall Azure platform invests significant engineering effort into platform availability and reliability. In SQL database, we have multiple replicas of every database. During upgrade, we ensure that hot standbys are available to take over immediately.
We’ve worked closely with the broader Azure and Service Fabric teams to minimize the number of failovers. When we first decide to fail over a database for upgrade, we apply updates to all components in the stack at the same time: OS, Service Fabric, and SQL Server. We have automatic scheduling that avoids deploying during an Azure region’s core business hours. Just before failover, we attempt to drain active transactions to avoid aborting them. We even utilize database workload patterns to perform failover at the best time for the workload.
Even with all that, we don’t get away from the fact that to update SQL Engine to a new version, we must restart the process and failover the database’s primary replica at least once. Or do we?
Hot patching and results
Hot patching is modifying in-memory code in a running process without restarting the process. In our case, it gives us the capability to modify C++ code in SQL Engine without restarting sqlservr.exe. Since we don’t restart, we don’t failover the primary replica and interrupt the workload. We don't even need to pause SQL Server activity while we patch. Hot patching is unnoticed by the user workload, other than the patch payload, of course!
Hot patching does not replace traditional, restarting upgrades – it complements them. Hot patching currently has limitations that make it unsuitable when there are a large number of changes, such as when a major new feature is introduced. But it is perfect for smaller, targeted changes. More than 80 percent of typical SQL bug fixes are hot patchable. Benefits of hot patching include:
- Reduced workload disruption - No restart means no database failover and no workload impact.
- Faster bug fixes - Previously, we weighed the urgency of a bug fix vs. impact on customer workloads from deploying it. Sometimes we would deem a bug fix not important enough for worldwide rollout because of the workload impact. With hot patching, we can now deploy bug fixes worldwide right away.
- Features available sooner - Even with the 500,000+ functional tests that we run several times per day and thorough testing of every new feature, sometimes we discover problems after a new feature has been made available to customers. In such cases, we may have to disable the feature or delay go-live until the next scheduled full upgrade. With hot patching, we can fix the problem and make the feature available sooner.
We did the first hot patch in production in 2018. Since then, we have hot patched millions of SQL Servers every month. Hot patching increases SQL Database ship velocity by 50 percent, while at the same time improving availability.
How hot patching works
For the technically interested, see our technical blog post for a detailed explanation of how hot patching works under the covers. Start reading at section three.
Closing words and next steps
With the capability in place, we are now working to improve the tooling and remove limitations to make more changes hot patchable with quick turnaround. For now, hot patching is only available in Azure SQL Database, but some day it may also come to SQL Server. Let us know via SQLDBArchitects@microsoft.com if you would be interested in that.
Please leave comments and questions below or contact us on the email above if you would like to see more in-depth coverage of cool technology we work on.