Azure Automation: Reliable, Fault-Tolerant Runbook Execution Using Checkpoints

iStock_000017251282Small

As an Azure Automation runbook author you want to create runbooks that execute reliably in the face of unexpected issues like errors, exceptions, network issues, and crashes.  Azure Automation helps you with this.  Azure Automation is built on Windows PowerShell Workflow which has support for checkpointing – the ability to persist the state of the workflow so that if it is interrupted, it can later be resumed at or near the interruption point.  Thus, checkpointing is a powerful feature that you will want to leverage in your Automation runbooks.  Thoughtful use of checkpointing will allow you to create runbooks that reliably automate long-running processes, reliably access different networked systems, guarantee the non-repeat of actions that should not be repeated (not idempotent) or are expensive to repeat, and that can be intentionally interrupted for inclusion of manual steps.

In this post I will talk about why, when, and how you should use checkpointing in your Automation runbooks.  There is existing information about checkpointing in PowerShell Workflow that you will want to brush up on to help with your understanding.

What is a Checkpoint?

A checkpoint is a snapshot of the current state of a runbook job, including the current values of variables, any output, and other serializable state information.  Each checkpoint gets saved to storage.  If a runbook is suspended, either intentionally or unintentionally, and then resumed, the workflow engine uses the data in the latest checkpoint to restore and resume the runbook.

Checkpointing in Azure Automation

In Azure Automation, when you persist a runbook job a checkpoint is created and then stored in the Azure Automation database.  Only the latest checkpoint for each job is stored in the database: each checkpoint replaces the previous.  If the runbook gets suspended and then resumed, the stored checkpoint will be used to restore and resume the runbook.

Unlike PowerShell Workflow which stores checkpoints to the hard drive of the machine hosting the workflow session, Azure Automation stores checkpoints in the Azure Automation database.  Therefore if the worker running your runbook crashes, the same restarted worker or another worker can pick up the job and use the last checkpoint in the database to resume the job.

Why Checkpoint?

Here are a few reasons to use checkpointing in your runbooks:

  • Assure that certain actions are not repeated
    • Checkpointing is useful for guaranteeing that non-repeatable actions (non-idempotent) are not repeated if a runbook crashes (suspends) and then resumes.  One example is to checkpoint a runbook right after creating a VM so that a duplicate VM would not be created if the runbook job were suspended and then resumed.
  • Protect long-running tasks
    • In the real world, errors happen. Long-running tasks with multiple steps are vulnerable to interruption due to network issues, machine reboots or crashes, timeouts, power outages, etc.  To avoid redoing expensive work, checkpoint the runbook at critical points, and assure that any runbook restarts do not redo that work.
  • Assure that long-running runbooks finish
    • Azure Automation has a feature called “fairshare”, where any runbook that runs for 30 minutes is unloaded to allow other runbooks to run.  Eventually, the unloaded runbook will be reloaded, and when it is it will resume execution from the last checkpoint taken in the runbook.  Thus, in order to guarantee that the runbook will eventually complete, you must add checkpoints at intervals that run for less than 30 minutes.  (This forum post gives one example of the problem.)
  • Allow planned or manual interruptions
    • There are scenarios where you may want to intentionally suspend a running runbook.  Examples include suspending a runbook job in order to wait for approval to continue, or suspending a runbook job to wait for fixes to unexpected or planned system issues.

 

How to Add Checkpoints to a Runbook?

Checkpoint-Workflow Activity

The Checkpoint-Workflow activity (alias Persist) is a standard PowerShell Workflow activity and can be used in a runbook to create a checkpoint at a particular point.  The checkpoint is made at the point in the runbook where the Checkpoint-Workflow activity occurs.


Download-Updates
Reboot-VM
Checkpoint-Workflow
Email-Team
Checkpoint-Workflow

-PSPersist Activity Common Parameter

Whenever you call an activity you can include the –PSPersist common workflow activity parameter.  This will force the creation of a checkpoint immediately after the activity completes.

…
Download-Updates
Reboot-VM –PSPersist $True
Email-Team –PSPersist $True
…

$PSPersistPreference Workflow Preference Variable

In a runbook, you can include the statement $PSPersistPreference = $True.  The effect of this is to cause a checkpoint to be taken after each activity which follows the preference statement.  If you set this preference at the start of the runbook, then a checkpoint will be made after each activity in the runbook.  You can turn off the automatic checkpointing by including the statement $PSPersistPreference = $False (which is the runbook default), after which activities will run without automatic checkpoints.

Note that for performance and strategic reasons, persisting after each activity may not be the best approach.  Each checkpoint requires processing to serialize the workflow state and store it in the database.  Also, there are scenarios (example later) where if the runbook is suspended you will want to repeat some activities. For these reasons this approach is not recommended.

…
Download-Updates
$PSPersistPreference = $True
Update-VM
Email-Team
$PSPersistPreference = $False
…

Suspend-Workflow Activity

When the Suspend-Workflow activity is used in a runbook, the immediate response is to checkpoint the runbook and then suspend it.  You would use this activity in a runbook, for example, if you need the runbook to do some work and then to wait for approval before continuing. The way you would “grant” that approval would be to resume the runbook job.

…
Download-Updates
# Get permission to apply updates
Suspend-Workflow
# Continue if resumed
Reboot-VM –PSPersist $True
Email-Team –PSPersist $True
…

Where to Add Checkpoints

In general, it is best to be explicit about where you persist your workflow.  Rather than setting the $PSPersistPreference variable to get blanket checkpointing after each activity, it is typically better to be thoughtful and strategic and use the Checkpoint-Workflow or Suspend-Workflow activities or –PSPersist parameter in those places in your workflow where persistence makes sense.  There are places where you definitely want to persist a workflow, and there are places where you definitely do not want to persist a workflow (examples below).  Also, keep in mind that persisting a workflow requires work from the system and will affect workflow performance by some amount.

Best Practice:  You may want to add checkpoints in your workflow in these cases:

  • After any activity that you do not want to repeat (not idempotent).
  • After any long-running or expensive activity that you would not want to repeat due to cost.
  • Within any runbook that will run for longer than 30 minutes.  After 30 minutes the “fairshare” feature kicks in and temporarily unloads the runbook so that other runbooks can run.  Eventually, the system will reload the runbook and resume execution from the last checkpoint.  If you don’t add any checkpoints, then the runbook will resume from the beginning, and of course it will run into the fairshare limit again; this will happen over and over, and the runbook will never finish.
  • Before any activity that has higher than normal probability of issues that could lead to failure and workflow suspension.  You want to repeat the activity when the workflow resumes to assure that the activity work gets done.  Examples include activities that access remote systems that may be susceptible to network issues.

Best Practice:  You should not add checkpoints in these cases:

  • After work that you want to repeat if the workflow is suspended and resumed
  • After work that is idempotent and cheaper to repeat than creating a checkpoint
  • In InlineScript blocks (it is not allowed)

Illustrative Scenario: Update VM

  1. Download the latest patches from Windows Update
  2. Restart the VM to apply the patches
    • Checkpoint
  3. Email the team to report that updates were applied
    • Checkpoint

In this scenario, it is ok to repeat step 1 (since it is idempotent), but not steps 2 or 3. Thus, checkpoints are certainly needed after steps 2 and 3.  Automatically persisting after each activity would also work; however, adding a checkpoint after step 1 unnecessarily adds work to the system.

Illustrative Scenario: Notify Customers

  1. Get list of customers from database
  2. Email customers about new policy
    • Checkpoint
  3. Email management that customer email went out
    • Checkpoint

Sometimes you have groups of activities that you don’t want to repeat, but only if all activities in that group succeed.  In this scenario, Steps 1 and 2 should always be run together, to assure that the list of customers retrieved is up to date when the email goes out.  Thus, if the runbook worker crashes before step 2 (sending the customer emails), when the runbook job resumes, we want it to start from step 1 again (retrieve customer list).  However, if there is a crash or suspension just before step 3, then we want to assure that step 2 is not repeated (don’t want to email the customers again).

Best Practice:  It is important to note that you cannot add checkpoints within InlineScript blocks or functions in a workflow.  This is because the code in InlineScript blocks and functions runs as PowerShell script and not as PowerShell Workflow script.  Thus, in order to take advantage of workflow persistence, as a best practice you should split your runbook code into multiple modular activities to allow you to add checkpoints between activities, or if you need InlineScript then use multiple InlineScript blocks to allow checkpointing between them.

Suspending and Resuming Runbooks

Checkpointing and suspending/resuming runbooks go hand in hand.  You add checkpoints to a runbook so that if the runbook is suspended the runbook can be resumed from the latest checkpoint.

A runbook job in Azure Automation can be suspended in several ways:

  • Intentionally by the user in the Azure Automation portal UI
    • Using the Azure Automation portal UI you can select to suspend a running runbook job.
    • The job will be suspended at the next checkpoint.  If you have not authored any checkpoints into the runbook, then the runbook will continue running to the end, all the while showing a status of “Suspending”.
  • Intentionally by the user within a runbook using Suspend-Workflow
    • Include the Suspend-Workflow activity in a runbook.
    • The job will be checkpointed and then suspended at the place where Suspend-Workflow is called.
  • Intentionally by the user using the Suspend-AzureAutomationJob cmdlet
    • From a PowerShell script or workflow you can use the Suspend-AzureAutomationJob cmdlet to suspend a running Azure Automation runbook job.
    • The job will be suspended at the next checkpoint.  If you have not authored any checkpoints into the runbook, then the runbook will continue running to the end, all the while showing a status of “Suspending”.
  • Intentionally by the Azure Automation workflow engine when a runbook runs for longer than 30 minutes
    • When a running job runs for longer than 30 minutes the “fairshare” feature will kick in, and the runbook will be temporarily unloaded. The job status will be set as “Running, Waiting for Resources”.  Eventually, the runbook will be reloaded, and execution will start from the last checkpoint.
  • Unintentionally by the Azure Automation workflow engine after a runbook exception
    • When a running job throws an exception it will be unloaded from the runbook worker and its status will be set as “Suspended”.
  • Unintentionally due to a runbook worker crash
    • If a runbook worker crashes, the jobs that are running on that worker will terminate immediately.  The state of these jobs in the database will remain as “Running”. When the same or a replacement worker comes back on line, the jobs will be picked up and continue from their last checkpoint.

A runbook job in Azure Automation can be resumed in several ways.  In all cases, the job will resume from the last checkpoint, or from the beginning if there is no checkpoint.

  • Manually in the Azure Automation portal UI
    • Using the Azure Automation portal UI you can select to resume a suspended job.
  • Using the Resume-AzureAutomationJob cmdlet
  • Automatically following a runbook worker crash
    • When the worker comes back online or when another worker is assigned as its replacement, the worker will look for jobs in the database that are assigned to it.  For any jobs that have state of “Running” and which are not yet running on the worker, the worker will automatically resume them from their last checkpoint (this is same scenario as #5 in suspending list above).

It is important to note that when a runbook is resumed, the runbook can be resumed on a different worker than it was on before suspension.  Thus, any local state that the runbook expects to exist needs to be recreated. For example, this would mean that cmdlets like Connect-Azure, Add-AzureAccount, Select-AzureSubscription, or Set-AzureSubscription, which set state in local files, would need to be called again after each checkpoint if the runbook will need to connect with Azure at any point after the checkpoint.

Summary

As you can see, adding checkpoints to your runbooks is important if you want to take advantage of this key feature of PowerShell Workflow and create interruption-resilient runbooks.  Adding checkpoints is easy.  With a little forethought during runbook authoring, you can protect your long-running and expensive tasks from unexpected interruption and truly create robust, reliable runbooks.