Advancing in-datacenter critical environment infrastructure availability | Azure Blog

There are many factors that can affect critical environment infrastructure availability—the reliability of the infrastructure building blocks, the controls during the datacenter construction stage, effective health monitoring and event detection schemes, a robust maintenance program, and operational excellence to ensure that every action is taken with careful consideration of related risk implications.

“One of the foundations of the Azure cloud computing platform is the availability of the critical environment (CE) infrastructure, which provides power and cooling to IT infrastructure in our datacenters around the world. For today’s post in our Advancing Reliability series, I’ve asked Reliability Engineering Director Randy Kong from our Cloud Operations and Innovation Engineering Team to explain how we identify and mitigate the various risks associated with these essential systems. What follows is a nuts-and-bolts highlight reel of the lengths that Microsoft and our partners go to, in this space, with a view to ensuring world-class reliability for the critical applications that our customers and partners run on the Azure platform.”—Mark Russinovich, CTO, Azure

There are many factors that can affect critical environment (CE) infrastructure availability—the reliability of the infrastructure building blocks, the controls during the datacenter construction stage, effective health monitoring and event detection schemes, a robust maintenance program, and operational excellence to ensure that every action is taken with careful consideration of related risk implications.

In addition to leading the industry in designing, constructing, commissioning, and operating a high availability CE infrastructure, Microsoft also invests in developing and implementing best practices in the reliability engineering field—aiming to elevate and sustain CE infrastructure availability. In general terms, “reliability engineering” in this blog refers to the engineering discipline that focuses on proactively identifying and assessing time-latent risks which can impact system functionality during its expected useful life—with various failure modes, effects, and mechanisms. As one simple example, when a datacenter is cooled using more energy-efficient free air cooling, we make sure that it’s the CE infrastructure that will deliver a controlled temperature, humidity, and air quality environment. One of the purposes is to prevent corrosion-related failures since electrical shorts or opens can occur under inappropriate conditions such as for the printed circuit board assembly.

The CE for hyperscale cloud datacenters is typically constructed with a blend of off-the-shelf components from a diverse vendor base—equipment like uninterruptible power supplies (UPS), power distribution units (PDU), automatic transfer switches (ATS), air handling units (AHU), and generators. This multi-vendor, multi-technology environment introduces intrinsic complexity as well as variations in robustness, mostly dependent on vendor competencies and experience. The monitoring of the CE infrastructure health is also limited, for example, due to insufficient information about these off-the-shelf components. Monitoring therefore mostly relies on high-level “failure effect” detection, such as from an electrical power management system (EPMS) or a building automation system (BAS).

However, failure effect-based detection schemes generally make it challenging to detect and pinpoint the offending element quickly, thus potentially delaying the rapid recovery of services when failures do occur. The scale of global operations adds further challenges—as variations in local conditions can result in different stresses on the equipment—posing challenges for the effectiveness of the typically time-based preventive maintenance approach. Since some geographic areas experience more notable changes in temperature, humidity, or air quality, a time-based maintenance approach could become wasteful or miss the opportunity to “refresh” system reliability if it does not account for the rate of degradation under the corresponding environmental stress conditions.

Investing in CE reliability engineering

To address these kinds of challenges, Microsoft has invested in establishing the CE reliability engineering function. Many of the related reliability engineering and technologies have emerged from the electronics industry in the last few decades, and there are ample opportunities to adapt, expand, and innovate reliability best practices and solutions for the datacenter CE infrastructure. For example, for off-the-shelf CE components, we have sought a close partnership with CE vendors in application-specific reliability risk evaluations during our equipment selection and qualification stages. During the operation stage, we also closely collaborate with our vendor base to drive continuous improvement through in-depth physical and data analytics—ranging from root cause analyses for individual cases to fleet-wide health behavior data analytics. These partnerships help both Microsoft and our vendors to have clear, data-driven visibility into the related reliability trends, areas of potential risks, and underlying contributing factors, all so that effective resolutions can be introduced, often proactively before more consequential impact occurs. While improving the understanding of potential CE infrastructure risks, reliability engineering is also focused on Research and Development (R&D) efforts that can result in more proactive and effective infrastructure health sensing, such as by capturing parameters correlated with the failure modes and mechanisms for high criticality areas. Internal, as well as external partnerships, have been established in the R&D of relevant approaches, from data mining and machine learning to Physics-of-Failure (PoF) methodologies.

As Microsoft continues to introduce innovations into the CE space, reliability engineering is playing a central role in ensuring the robustness of these solutions by driving both designed-in and built-in reliability through early-stage risk analysis and mitigations. For instance, the Microsoft Sphere-based IoT solution is developed to acquire data securely from the mechanical and electrical power CE system. Reliability engineering closely works with internal and external product design and manufacturing partners in applying both analytical- and PoF-based testing approaches throughout the solution concept, prototype, design, process development, and product deployment stages. One case in point is the concern about electronics’ packaging-related defects, during its manufacture or assembly, or its usage lifecycle. A simulation tool based on finite element analysis (FEA) was employed to identify thermal-mechanical stress points, even when the design is still just a drawing on paper, as these stress points can result in failures within the expected useful life. These points are then closely tracked and characterized (for example, with strain gauges) during environmental stress tests or during manufacturing process steps that can introduce the related thermo-mechanical stresses. Even if the system may still be functional after experiencing these stresses, samples are physically cross-sectioned in critical junctions to identify any early initiation of defects. These in-depth analyses enable concurrent product or process design changes to eliminate the failure likelihood, thereby elevating eventual system availability.

Similar design-for-excellence (DFX) strategies are also being explored for the complex CE infrastructure itself to enable proactive risk identification and prevention opportunities—and prior to the physical deployment of the infrastructure wherever feasible. These CE-related investments and technology advances in the reliability engineering discipline will help Microsoft’s CE infrastructure to build additional robustness mechanisms, in order to meet our customer’s availability expectations for a world-class cloud computing service.