Skip Navigation

Microsoft expands its AI-supercomputer lineup with general availability of the latest 80GB NVIDIA A100 GPUs in Azure, claims 4 spots on TOP500 supercomputers list

Posted on November 15, 2021

Senior Program Manager, Azure HPC and AI

Today, Microsoft announced the general availability of a brand-new virtual machine (VM) series in Azure, the NDm A100 v4 Series, featuring NVIDIA A100 Tensor Core 80 GB GPUs. This expands Azure leadership-class AI supercomputing scalability in the public cloud, building on our June general availability of the original ND A100 v4 instances, and adding another public cloud first with the Azure ND A100 v4 VMs claiming four official places in the TOP500 supercomputing list. This milestone is thanks to a class-leading design with NVIDIA Quantum InfiniBand networking, featuring In-Network Computing, 200 GB/s and GPUDirect RDMA for each GPU, and an all-new PCIe Gen 4.0-based architecture.

We live in the era of large-scale AI models, the demand for large scale computing keeps growing. The original ND A100 v4 series features NVIDIA A100 Tensor Core GPUs each equipped with 40 GB of HBM2 memory, which the new NDm A100 v4 series doubles to 80 GB, along with a 30 percent increase in GPU memory bandwidth for today’s most data-intensive workloads. RAM available to the virtual machine has also increased to 1,900 GB per VM- to allow customers with large datasets and models a proportional increase in memory capacity to support novel data management techniques, faster checkpointing, and more.

The high-memory NDm A100 v4 series brings AI-Supercomputer power to the masses by creating opportunities for all businesses to use it as a competitive advantage. Cutting-edge AI customers are using both 40 GB ND A100 v4 VMs and 80 GB NDm A100 v4 VMs at scale for large-scale production AI and machine learning workloads, and seeing impressive performance and scalability, including OpenAI for research and products, Meta for their leading AI research, Nuance for their comprehensive AI-powered voice-enabled solution, numerous Microsoft internal teams for large scale cognitive science model training, and many more.

“Some of our research models can take dozens, or even hundreds of NVIDIA GPUs to train optimally, and Azure’s ND A100 v4 product helps address the growing training demands of large AI models. Modern training techniques require not only powerful accelerators, but also a communication fabric between them, and Azure’s implementation of NVIDIA Quantum InfiniBand 200 GB/s networking with GPUDirect RDMA between each NVIDIA A100 GPU has allowed us to use PyTorch and the communication libraries we’re already familiar with, without modification.”—Myle Ott, Research Engineer, Meta AI Research

“The pace of innovation in conversational AI is gated in part by experimental throughput and turnaround time. With the ND A100 v4, we are able to not only complete experiments in half the time vs the NDv2  but also benefit from significant per-experiment PAYG cost savings. This will be a critical accelerant for the advancement of our Dragon Ambient eXperience technologies.”—Paul Vozila, VP, Central Research at Nuance Communications 

"We live in the era of large-scale AI models, like the recently announced MT-NLG 530B. Training state-of-the-art Turing models at this size presented unprecedented challenges to the underlying training infrastructure, at the same time significantly raised the bar for acceleration, networking, stability, and availability. Similar to the collaborative research effort with NVIDIA Selene supercomputing infrastructure, Azure NDm A100 v4 with 80 GB of high bandwidth memory can remove many existing limits in scaling up models, such as increasing the maximum number of parameters and reducing the number of nodes required. Its performance and agility can provide a serious competitive edge to Azure customers in the race of advancing AI."—Microsoft Turing

The new high-memory NDm A100 v4 for data-intensive GPU compute workloads reaffirms Microsoft’s commitment to rapidly adopting and shipping the latest scale-up and scale-out GPU accelerator technologies to the public cloud.

We can’t wait to see what you’ll build, analyze, and discover with the new Azure NDm A100 v4 platform.

 

Size

Physical CPU Cores Host Memory (GB) GPUs Local NVMe Temporary Disk NVIDIA Quantum InfiniBand Network Azure Network

Standard_ND96amsr

_A100_v4

96 1,900 GB 8 x 80 GB NVIDIA A100 6,400 GB 200 GB/s 40 Gbps

Learn more