SR-IOV availability on NCv3 Virtual Machines SKU
Published date: October 17, 2019
As part of Azure’s ongoing commitment to providing industry-leading performance, we are announcing enhancements to enable support for all MPI implementations and versions, and RDMA verbs for InfiniBand-equipped virtual machines, beginning with NCv3 coming in early November 2019.
The upgrade WILL INVOLVE SERVER DOWNTIME on a regional basis and, if you intend to utilize the InfiniBand network using MPI, this REQUIRES AN UPDATE TO YOUR VMs. Please read below for full details.
WHAT’S COMING?
With the rapid growth of multi-node computation and model training, customers’ needs have evolved as well as the software they use. This update will expand our support to include the entire MPI stack, enabling you to use the InfiniBand RDMA network for low-latency and high-bandwidth communication between VMs using SR-IOV.
Intel MPI version 5.x will continue to be supported as will all subsequent Intel MPI versions. In addition, all other MPIs supported by the Open Fabric Enterprise Distribution (OFED), OpenMPI, and Nvidia’s NCCL2 library, providing optimized performance for GPUs, will be supported. These enhancements will provide customers with higher InfiniBand bandwidth, lower latencies, and most importantly, better distributed application performance.
IMPACT
All users of NCv3 SKUs will be impacted on a region-by-region basis (see schedule below). The update involves changes to both server hardware and software, which requires downtime. During downtime:
- NCv3 machines in the region will be unavailable for a 3-hour period
- All VMs on NCv3 machines in the region will be removed and re-deployed after the update
- Data stored on local (ephemeral) disks will be lost. Storage Accounts are unaffected
ACTION REQUIRED
To avoid data loss and minimize potential impact to your service, please take the following steps:
- Ensure all jobs are complete and data is backed up to your Storage Account before the scheduled update. Any data stored locally will be lost.
- Review the NCv3 update schedule. If needed, you may consider temporarily migrating to an alternate region. If so, check existing or request new quota in the intended alternate region(s).
- If your scenarios do not require InfiniBand or MPI
- You do not need to make any changes to your image or configuration.
- If you do require InfiniBand or MPI, please do the following:
- For managed services supporting InfiniBand scenarios, please see service-specific guidance (e.g., Azure Batch, Azure Machine Learning).
- We strongly recommend you update your OS to a version including inbox drivers for InfiniBand; however, if your current image already includes inbox driver support for InfiniBand, we encourage you to test beforehand (see last bullet below)
- Download and install the latest OFED driver if not already included in your image (a limited set may include them out of box). See this article for complete steps.
- Test your updated image & drivers on Hb or Hc VMs, which are already SR-IOV enabled.
For any questions or concerns, please reach out to Azure GPU Feedback (azurenfeedback@microsoft.com) or your Customer Service Support representative.