Skip to main content
Azure
  • 7 min read

Accelerating open-source infrastructure development for frontier AI at scale

""

Tech Community

Connect with a community to find answers, ask questions, build skills, and accelerate your learning.

Visit the Azure Infrastructure tech community
Microsoft is contributing new standards across power, cooling, sustainability, security, networking, and fleet resiliency to advance innovation.

In the transition from building computing infrastructure for cloud scale to building cloud and AI infrastructure for frontier scale, the world of computing has experienced tectonic shifts in innovation. Throughout this journey, Microsoft has shared its learnings and best practices, optimizing our cloud infrastructure stack in cross-industry forums such as the Open Compute Project (OCP) Global Foundation.

Today, we see that the next phase of cloud infrastructure innovation is poised to be the most consequential period of transformation yet. In just the last year, Microsoft has added more than 2 gigawatts of new capacity and launched the world’s most powerful AI datacenter, which delivers 10x the performance of the world’s fastest supercomputer today. Yet, this is just the beginning.

Delivering AI infrastructure at the highest performance and lowest cost requires a systems approach, with optimizations across the stack to drive quality, speed, and resiliency at a level that can provide a consistent experience to our customers. In the quest to supply resilient, sustainable, secure, and widely scalable technology to handle the breadth of AI workloads, we’re embarking on an ambitious new journey: one not just of redefining infrastructure innovation at every layer of execution from silicon to systems, but one of tightly integrated industry alignment on standards that offer a model for global interoperability and standardization.

At this year’s OCP Global Summit, Microsoft is contributing new standards across power, cooling, sustainability, security, networking, and fleet resiliency to further advance innovation in the industry.

Redefining power distribution for the AI era

As AI workloads scale globally, hyperscale datacenters are experiencing unprecedented power density and distribution challenges.

Last year, at the OCP Global Summit, we partnered with Meta and Google in the development of Mt. Diablo, a disaggregated power architecture. This year, we’re building on this innovation with the next step of our full-stack transformation of datacenter power systems: solid-state transformers. Solid-state transformers simplify the power chain with new conversion technologies and protection schemes that can accommodate future rack voltage requirements.

Training large models across thousands of GPUs also introduces variable and intense power draw patterns that can strain the grid. The utility, and traditional power delivery systems. These fluctuations not only risk hardware reliability and operational efficiency but also create challenges across capacity planning and sustainability goals.

Together with key industry partners, Microsoft is leading a power stabilization initiative to address this challenge. In a recently published paper with OpenAI and NVIDIA—Power Stabilization for AI Training Datacenters—we address how full-stack innovations spanning rack-level hardware, firmware orchestration, predictive telemetry, and facility integration can smooth power spikes, reduce power overshoot by 40%, and mitigate operational risk and costs to enable predictable, and scalable power delivery for AI training clusters.

This year, at the OCP Global Summit, Microsoft is joining forces with industry partners to launch a dedicated power stabilization workgroup. Our goal is to foster open collaboration across hyperscalers and hardware partners, sharing our learnings from full-stack innovation and inviting the community to co-develop new methodologies that address the unique power challenges of AI training datacenters. By building on the insights from our recently published white paper, we aim to accelerate industry-wide adoption of resilient, scalable power delivery solutions for the next generation of AI infrastructure. Read more about our power stabilization efforts.

Cooling innovations for resiliency

As the power profile for AI infrastructure changes, we are also continuing to rearchitect our cooling infrastructure to support evolving needs around energy consumption, space optimization, and overall datacenter sustainability. Various cooling solutions must be implemented to support the scale of our expansion—as we seek to build new AI-scale datacenters, we are also utilizing Heat Exchanger Unit (HXU)-based liquid cooling to rapidly deploy new AI capacity within our existing air-cooled datacenter footprint.

Microsoft’s next generation HXU is an upcoming OCP contribution that enables liquid cooling for high-performance AI systems in air-cooled datacenters, supporting global scalability and rapid deployment. The modular HXU design delivers 2X the performance of current models and maintains >99.9% cooling service availability for AI workloads. No datacenter modifications are required, allowing seamless integration and expansion. Learn more about the next generation HXU here. 

Meanwhile, we’re continuing to innovate across multiple layers of the stack to address changes in power and heat dissipation—utilizing facility water cooling at datacenter-scale, circulating liquid in closed-loops from server to chiller; and exploring on-chip cooling innovations like microfluidics to efficiently remove heat directly from the silicon.

Unified networking solutions for growing infrastructure demands 

Scaling hundreds of thousands of GPUs to operate as a single, coherent system comes with significant challenges to create rack-scale interconnects that can deliver low-latency, high bandwidth fabrics that are both efficient and interoperable. As AI workloads grow exponentially and infrastructure demands intensify, we are exploring networking optimizations that can support these needs. To that end, we have developed solutions leveraging scale-up, scale-out, and Wide Area Network (WAN) solutions to enable large-scale distributed training.

We partner closely with standards bodies, like UEC (Ultra Ethernet Consortium) and UALink, focused on innovation in networking technologies for this critical element of AI systems. We are also driving forward adoption of Ethernet for scale-up networking across the ecosystem and are excited to see Ethernet for Scale-up Networking (ESUN) workstream launch under the OCP Networking Project. We look forward to promoting adoption of cutting-edge networking solutions and enabling multi-vendor Ecosystem based on open standards.

Security, sustainability, and quality: Fundamental pillars for resilient AI operations

Defense in depth: Trust at every layer

Our comprehensive approach to scaling AI systems responsibly includes embedding trust and security into every layer of our platform. This year, we are introducing new security contributions that build on our existing body of work in hardware security and introduce new protocols that are uniquely fit to support new scientific breakthroughs that have been accelerated with the introduction of AI:

  • Building on past years’ contributions and Microsoft’s collaboration with AMD, Google, and NVIDIA, we have further enhanced Caliptra, our open-source silicon root of trust The introduction of Caliptra 2.1 extends the hardware root-of-trust to a full security subsystem. Learn more about Caliptra 2.1 here.
  • We have also added Adams Bridge 2.0 to Caliptra to extend support for quantum-resilient cryptographic algorithms to the root-of-trust.
  • Finally, we are contributing OCP Layered Open-source Cryptographic Key Management (L.O.C.K)—a key management block for storage devices that secures media encryption keys in hardware. L.O.C.K was developed through collaboration between Google, Kioxia, Microsoft, Samsung, and Solidigm.

Advancing datacenter-scale sustainability 

Sustainability continues to be a major area of opportunity for industry collaboration and standardization through communities such as the Open Compute Project. Working collaboratively as an ecosystem of hyperscalers and hardware partners is one catalyst to address the need for sustainable datacenter infrastructure that can effectively scale as compute demands continue to evolve. This year, we are pleased to continue our collaborations as part of OCP’s Sustainability workgroup across areas such as carbon reporting, accounting, and circularity:

  • Announced at this year’s Global Summit, we are partnering with AWS, Google, and Meta to fund the Product Category Rule initiative under the OCP Sustainability workgroup, with the goal of standardizing carbon measurement methodology for devices and datacenter equipment.
  • Together with Google, Meta, OCP, Schneider Electric, and the iMasons Climate Accord, we are establishing the Embodied Carbon Disclosure Base Specification to establish a common framework for reporting the carbon impact of datacenter equipment.
  • Microsoft is advancing the adoption of waste heat reuse (WHR). In partnership with the NetZero Innovation Hub, NREL, and EU and US collaborators, Microsoft has published heat reuse reference designs and is developing an economic modeling tool which provide data center operators and waste heat off takers/consumers the cost it takes to develop the waste heat reuse infrastructure based on the conditions like the size and capacity of the WHR system, season, location, WHR mandates and subsidies in place. These region-specific solutions help operators convert excess heat into usable energy—meeting regulatory requirements and unlocking new capacity, especially in regions like Europe where heat reuse is becoming mandatory.
  • We have developed an open methodology for Life Cycle Assessment (LCA) at scale across large-scale IT hardware fleets to drive towards a “gold standard” in sustainable cloud infrastructure.

Rethinking node management: Fleet operational resiliency for the frontier era

As AI infrastructure scales at an unprecedented pace, Microsoft is investing in standardizing how diverse compute nodes are deployed, updated, monitored, and serviced across hyperscale datacenters. In collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, we are driving a series of Open Compute Project (OCP) contributions focused on streamlining fleet operations, unifying firmware management, manageability interfaces and enhancing diagnostics, debug, and RAS (Reliability, Availability, and Serviceability) capabilities. This standardized approach to lifecycle management lays the foundation for consistent, scalable node operations during this period of rapid expansion. Read more about our approach to resilient fleet operations

Paving the way for frontier-scale AI computing 

As we enter a new era of frontier-scale AI development, Microsoft takes pride in leading the advancement of standards that will drive the future of globally deployable AI supercomputing. Our commitment is reflected in our active role in shaping the ecosystem that enables scalable, secure, and reliable AI infrastructure across the globe. We invite attendees of this year’s OCP Global Summit to connect with Microsoft at booth #B53 to discover our latest cloud hardware demonstrations. These demonstrations showcase our ongoing collaborations with partners throughout the OCP community, highlighting innovations that support the evolution of AI and cloud technologies.

Connect with Microsoft at the OCP Global Summit 2025 and beyond

Tech Community

Connect with a community to find answers, ask questions, build skills, and accelerate your learning.

Visit the Azure Infrastructure tech community