Responding to Role Topology Changes

The Adoption Program Insights series describes experiences of Microsoft Services consultants involved in the Windows Azure Technology Adoption Program assisting customers deploy solutions on the Windows Azure Platform. This post is by Tom Hollander.

In the past, if you had an application running in a web farm and you needed more capacity, you would have needed to buy, install and configure additional physical machines – a process which could take months and potentially cost thousands of dollars. In contrast, if you deploy your application to Windows Azure this same process involves a simple configuration change and in minutes you can have additional instances deployed, and you only pay incremental hourly charges while these instances are in use. For applications with variable or growing load, this is a tremendous advantage of the Windows Azure platform.

If your role instances have been designed to be stateless and independent, you generally won’t need to write any code to handle the times when your roles are scaled up or down (known in Azure as a topology change) – Windows Azure handles the configuration of the environment and as soon as any new instances are available (or old ones removed), the load balancer is reconfigured and your application continues to run as per normal. However, in some advanced scenarios, you may need your instances to be aware of the overall context in which they are running, and they may need to perform certain tasks when the role topology changes.

This post will help you write applications that can respond to topology changes by describing how Windows Azure raises events and communicates information about the Role Environment during these changes. This guidance applies whether you’re scaling your application manually through the web portal, via the Service Management API or using automatic performance-based scaling.

Role Environment Methods and Events

There are five main places where you can write code to respond to environment changes. Two of these, OnStart and OnStop, are methods on the RoleEntryPoint class which you can override in your main role class (which is called WebRole or WorkerRole by default). The other three are events on the RoleEnvironment class which you can subscribe to: Changing, Changed and Stopping.

The purpose of these methods is pretty clear from their names:

  • OnStart gets called when the instance is first started.
  • Changing gets called when something about the role environment is about to change.
  • Changed gets called when something about the role environment has just been changed.
  • Stopping gets called when the instance is about to be stopped.
  • OnStop gets called when the instance is being stopped.

In all cases, there’s nothing your code can do to prevent the corresponding action from occurring, but you can respond to it in any way you wish. In the case of the Changing event, you can also choose whether the instance should be recycled to deal with the configuration change by setting e.Cancel = true.

Why aren’t Changing and Changed firing in my application?

When I first started exploring this topic, I observed the following unusual behaviour in both the Windows Azure Compute Emulator (previously known as the Development Fabric) and in the cloud:

  • The Changing and Changed events did not fire on any instance when I made configuration changes.
  • RoleEnvironment.CurrentRoleInstance.Role.Instances.Count always returned 1, even when there were many instances in the role.

It turns out that this is the expected behaviour when a role has in no internal endpoints defined, as documented in this MSDN article. So the solution is simply to define an internal endpoint in your ServiceDefinition.csdef file like this:

<Endpoints>
  <InternalEndpoint name=”InternalEndpoint1″ protocol=”http” />
</Endpoints>

Which Events Fire Where and When?

Even though the names of the events seem pretty self-explanatory, the exact behaviour when scaling deployments up and down is not necessarily what you might expect. The following diagram shows which events fire in an example scenario containing a single role. 2 instances are deployed initially, the deployment is then scaled to 4 instances, then back down to 3, and finally the deployment is stopped. 

There are several interesting things to note from this diagram:

  • 1. The Changing and Changed events only fire for the instances that aren’t starting or stopping. If you’re adding instances, these events don’t fire on the new instances, and if you’re removing instances, these events don’t fire on the ones being shut down.
  • 2. In the Changing event, RoleEnvironment.CurrentRoleInstance.Role.Instances returns the original role instances, not the target role instances. There is no way of finding out the target role instances at this time.
  • 3. In the Changed event, RoleEnvironment.CurrentRoleInstance.Role.Instances returns the target role instances, not the original role instances. If you need to know about the original instances, you can save this information when the Changing event fires and access it from the Changed event (since these events are always fired in sequence).
  • 4. When instances are started, RoleEnvironment.CurrentRoleInstance.Role.Instances returns the target role instances, even if many of them are not yet started.
  • 5. When instances are stopped, RoleEnvironment.CurrentRoleInstance.Role.Instances returns the original role instances. There is no way of finding out about the target instances at this time. Also note that there’s no way that any instance can determine which instances are being shut down (it won’t necessarily be the instances with the highest ID number). If Stopping and OnStop get called, it’s you. If Changing gets called, it’s not!

The above example assumed that the Changing event was not cancelled (with e.Cancel = true, which results in the instance being restarted before the configuration changes are applied). If you do choose to do this, the events that fire are quite different – Changed does not fire at all, but Stopping, Stopped and OnStart do. The following diagram shows what happens to instance IN_0 during a scale-up operation if the Changing event is cancelled.

One final note on these events: Although I didn’t show it in either diagram, if you have multiple roles in your service and make a topology change in a single role, the Changing and Changed events will fire across all roles, even those where the number of instances did not change. You can tell from the event data whether the topology change occurred for the current role or a different one using code similar to this:

private void RoleEnvironmentChanging(object sender, RoleEnvironmentChangingEventArgs e)
{
   var changes = from ch in e.Changes.OfType<RoleEnvironmentTopologyChange>()
                 where ch.RoleName == RoleEnvironment.CurrentRoleInstance.Role.Name
                 select ch;
   if (changes.Any())
   {
         // Topology change occurred in the current role
   }
   else
   {
         // Topology change occurred in a different role
   }
}

Getting More Information

While the RoleEnvironment and the events listed above provide a lot of good information about changes to a service, there can be times when you need more information than the API provides. For example, I once worked on an Azure application where each instance needed to know which other instances had already started, and what their IP Addresses were. I chose to leverage an Azure table to record key information about the running instances. Every time an instance started or stopped, it was responsible for recording these details in the table, which could be read by all other instances. While this solution worked well, it required some careful and defensive coding to deal with cases where the table may have contained stale or incorrect data due to ungraceful shut downs. As such, you should only build solutions like this if absolutely necessary.

Conclusion

The ability to scale applications as needed is one of the great benefits of Windows Azure and the Fabric Controller is able to provide detailed information about the current status of, and changes to, the role environment through RoleEntryPoint methods and RoleEnvironment events. For most applications you won’t need to put in any fancy code to handle scaling operations but if you’re dealing with more complex applications, we hope this information will help you understand how topology changes can be handled effectively by your applications.