Deep Learning, Simulation and HPC Applications with Docker and Azure Batch

The Azure Big Compute team is happy to announce version 1.0.0 of the Batch Shipyard toolkit, which enables easy deployment of batch-style Dockerized workloads to Azure Batch compute pools. Azure Batch enables you to run parallel jobs in the cloud without having to manage the infrastructure. It’s ideal for parametric sweeps, Deep Learning training with NVIDIA GPUs, and simulations using MPI and InfiniBand.

Whether you need to run your containerized jobs on a single machine or hundreds or even thousands of machines, Batch Shipyard blends features of Azure Batch — handling complexities of large scale VM deployment and management, high throughput, highly available job scheduling, and auto-scaling to pay only for what you use — with the power of Docker containers for application packaging. Batch Shipyard allows you to harness the deployment consistency and isolation for your batch-style and HPC containerized workloads, and run them at any scale without the need to develop directly to the Azure Batch SDK.

The initial release of Batch Shipyard has the following major features:

Automated Docker Host Engine installation tuned for Azure Batch compute nodes
Automated deployment of required Docker images to compute nodes
Accelerated Docker image deployment at scale to compute pools consisting of a large number of VMs via private peer-to-peer distribution of Docker images among the compute nodes
Automated Docker Private Registry instance creation on compute nodes with Docker images backed to Azure Storage if specified
Automatic shared data volume support for:
- Azure File Docker Volume Driver installation and share setup for SMB/CIFS backed to Azure Storage if specified
- GlusterFS distributed network file system installation and setup if specified
Seamless integration with Azure Batch job, task and file concepts along with full pass-through of the Azure Batch API to containers executed on compute nodes
Support for Azure Batch task dependencies allowing complex processing pipelines and graphs with Docker containers
Transparent support for GPU accelerated Docker applications on Azure N-Series VM instances (Preview)
Support for multi-instance tasks to accommodate Dockerized MPI and multi-node cluster applications on compute pools with automatic job cleanup
Transparent assist for running Docker containers utilizing Infiniband/RDMA for MPI on HPC low-latency Azure VM instances (i.e., STANDARD_A8 and STANDARD_A9)
Automatic setup of SSH tunneling to Docker Hosts on compute nodes if specified

We’ve also made available an initial set of recipes that enable scenarios such as Deep Learning, Computational Fluid Dynamics (CFD), Molecular Dynamics (MD) and Video Processing with Batch Shipyard. In fact, we are aiming to make Deep Learning on Azure Batch an easy, low friction experience. Once you have the toolkit installed and have Azure Batch and Azure Storage credentials, you can get CNTK, Caffe or TensorFlow running in an Azure Batch compute pool in under 15 minutes. Below is a screenshot of CNTK running on a GPU-enabled STANDARD_NC6 VM via Batch Shipyard with nvidia-smi:

We hope to continue to expand the repertoire of recipes available for Batch Shipyard in the future.

The Batch Shipyard toolkit can be found on GitHub. We welcome any feedback and contributions!

Deep Learning, Simulation and HPC Applications with Docker and Azure Batch

Fred Park

Frontier models and production agents: Advancing Microsoft Foundry for the agentic era

Meet Brain: The AI system behind Azure reliability

Proving application resilience on Azure with Chaos Studio

Explore Microsoft Foundry

Deep Learning, Simulation and HPC Applications with Docker and Azure Batch

Fred Park

Related posts

Frontier models and production agents: Advancing Microsoft Foundry for the agentic era

Meet Brain: The AI system behind Azure reliability

Proving application resilience on Azure with Chaos Studio

Explore Microsoft Foundry