Microsoft Azure uses Error-Correcting Code memory for enhanced reliability and security
Posted on March 16, 2015
Memory devices are prone to errors from various sources, including electrical, magnetic, background radiation, or manufacturing defects, to name a few examples. A common type of memory error is a single bit “flipping” from a 0 to 1 state, or vice-versa. Undetected or uncorrected memory errors can lead to reliability and security issues, such as corruption of code or data, system or application crashes, as well as potentially reducing the security of a system due to unauthorized changes to security sensitive code, data, or policy. Microsoft Azure utilizes Error-Correcting Code (ECC) memory throughout our fleet of deployed systems to protect against these kinds of issues. ECC memory is a technology that can detect and correct many types of memory errors, including single bit, and certain cases of multiple bit errors. With ECC memory, correction of errors is handled transparently to the application/service, and the application will always read the data that was originally written to memory. In cases where errors cannot be corrected, Microsoft Azure host operating systems are configured to immediately reboot and log a detailed event, thus preventing an application being exposed to a memory corruption. ECC memory is more expensive that non-ECC memory, and Microsoft felt this technology is appropriate to support hosting enterprise grade customer data in Azure. Microsoft has done extensive testing across our Azure product servers deployed within our datacenters, for a memory process scaling issue referred to as Rowhammer. Rowhammer is a memory hardware problem that could affect a subset of deployed DRAM memory modules in the marketplace, potentially creating an exploit by issuing certain patterns of memory read operations, which could result in unauthorized writes via “bit-flipping” in privileged regions of memory. To attempt a Rowhammer exploit, an attacker must be able to run code of their choosing on a target system. Most cloud services, such as Office365, do not allow a customer to run arbitrary software on systems hosting the service, so an attacker does not have the ability to directly execute a Rowhammer attack. However, Microsoft Azure allows customers to run arbitrary software, including potentially malicious software, for example in an IAAS virtual machine. Following testing conducted months ago, we determined that our production servers, customer VMs, and customer data are not impacted by Rowhammer. Microsoft Azure has deployed systems with ECC memory, which detects and corrects the types of Rowhammer errors induced and validated during testing. As an additional defense in depth measure, Microsoft Azure has monitoring and alerting deployed to detect any corrected and uncorrectable memory error conditions and Rowhammer attack attempts. More information on ECC memory is here. More information on Rowhammer is here.