One of the benefits of moving to the cloud is that you, our customer, don’t need to deal with hardware maintenance and repairs; you can focus your time on your business applications. Azure continuously monitors for hardware that shows signs of degradation or potential failure. When these conditions are detected, Azure will attempt to live migrate your virtual machines (VMs). If live migration isn’t possible, Azure will automatically redeploy VMs to a healthy machine. If you have a disaster recovery setup, which is highly recommended, the impact of this redeployment will be minimal. However, a redeployment to a healthy machine may be problematic for some applications that can’t tolerate disruption. We’ve received feedback that in this situation, when possible, customers prefer to control the time the redeployment to a healthy machine will occur.
We introduced Scheduled Events in Azure as a programmatic way to notify your VMs and act on upcoming maintenance events such as a live migration, redeployment, reboot, etc. Upon receiving the scheduled event, customers can take actions such as failover, saving state, drain sessions in the VMs, schedule a time for manual maintenance, notify customers, etc. We’re excited to announce that Scheduled Events will now be triggered when Azure predicts that hardware issues will require a redeployment to healthy hardware in the near future, and provide a time window when Azure will redeploy the VMs to healthy hardware if a live migration was not possible. Customers can initiate the redeployment of their VMs ahead of Azure automatically doing it.
Hardware failure prediction
Azure has taken insight from operating millions of servers in its data centers to identify when hardware health is degrading and predict in many cases a failure before it happens. For example, Azure can detect if there is degradation in disk IO performance on a given node, or detect memory errors, and determine if this will become fatal.
When Azure detects imminent hardware failure, VMs are proactively live migrated when possible. This should have minimal impact on your workloads and the customer experience is typically a freeze of a few seconds during the final phase. Subscribing to Scheduled Events allows your VM to be notified a few minutes before the live migration process is started. However, there are cases where live migration isn’t possible, like on specialized computer hardware such as M-Series, G-Series, etc. or on legacy hardware, in which case the VMs would be redeployed to a new instance. Some of our customers have expressed interest in being able to control the time to initiate a reallocation from the node and control the experience during the process. Based on this feedback, we enhanced Scheduled Events to notify the time the hardware is detected as unhealthy, and give the time the VM will be moved to another machine, provided the hardware does not fail sooner. In many cases there can be multiple days before the hardware fails and through mitigations, Azure tries to delay this failure time. Because the time to fail varies, we recommend customers move from degraded hardware as soon as possible.
How to listen to these Scheduled Events
Your VM must subscribe to Scheduled Events to get events related to maintenance. Watch this video to learn how to programmatically enable and react to Scheduled Events. You can also find code samples of how to listen to Scheduled Events and then approve them once you have done your mitigation.
To listen to hardware-related events, you don’t have to do anything different! Hardware-related events are delivered as a redeploy event. The NotBefore time, which is the property that gives the time window before the maintenance is performed, could range from a few hours to a few days and can change depending on the severity of the hardware fault. As Azure’s estimation for the time to failure improves, the NotBefore time window will change to become more accurate. But note that since you’re running on degraded hardware that can fail suddenly, you should initiate a redeployment or approve the scheduled event as soon as possible after initiating the corresponding automated or manual actions. Once you approve the request, your VM will be redeployed to a new physical machine. You can track the completion of the redeploy via Scheduled Events. If you don’t approve the scheduled event within the NotBefore time, you will no longer have control of the experience and Azure will redeploy your VM to a healthy machine.
Support for hardware degradation information via Scheduled Events is already available worldwide! There are no API changes so this feature that is available from api-version=2017-08-01.
If you are sensitive to platform maintenance events, I would highly encourage you to build automation by handling Scheduled Events. Try this out and let us know what you think in the comments below.