Deep Learning, Simulation and HPC Applications with Docker and Azure Batch

By Fred Park Principal Software Engineering Manager

Deep Learning, Simulation and HPC Applications with Docker and Azure Batch • 2 min read

Posted on September 22, 2016
2 min read

The Azure Big Compute team is happy to announce version 1.0.0 of the Batch Shipyard toolkit, which enables easy deployment of batch-style Dockerized workloads to Azure Batch compute pools. Azure Batch enables you to run parallel jobs in the cloud without having to manage the infrastructure. It’s ideal for parametric sweeps, Deep Learning training with NVIDIA GPUs, and simulations using MPI and InfiniBand.

Whether you need to run your containerized jobs on a single machine or hundreds or even thousands of machines, Batch Shipyard blends features of Azure Batch — handling complexities of large scale VM deployment and management, high throughput, highly available job scheduling, and auto-scaling to pay only for what you use — with the power of Docker containers for application packaging. Batch Shipyard allows you to harness the deployment consistency and isolation for your batch-style and HPC containerized workloads, and run them at any scale without the need to develop directly to the Azure Batch SDK.

The initial release of Batch Shipyard has the following major features:

Automated Docker Host Engine installation tuned for Azure Batch compute nodes
Automated deployment of required Docker images to compute nodes
Accelerated Docker image deployment at scale to compute pools consisting of a large number of VMs via private peer-to-peer distribution of Docker images among the compute nodes
Automated Docker Private Registry instance creation on compute nodes with Docker images backed to Azure Storage if specified
Automatic shared data volume support for:
- Azure File Docker Volume Driver installation and share setup for SMB/CIFS backed to Azure Storage if specified
- GlusterFS distributed network file system installation and setup if specified
Seamless integration with Azure Batch job, task and file concepts along with full pass-through of the Azure Batch API to containers executed on compute nodes
Support for Azure Batch task dependencies allowing complex processing pipelines and graphs with Docker containers
Transparent support for GPU accelerated Docker applications on Azure N-Series VM instances (Preview)
Support for multi-instance tasks to accommodate Dockerized MPI and multi-node cluster applications on compute pools with automatic job cleanup
Transparent assist for running Docker containers utilizing Infiniband/RDMA for MPI on HPC low-latency Azure VM instances (i.e., STANDARD_A8 and STANDARD_A9)
Automatic setup of SSH tunneling to Docker Hosts on compute nodes if specified

We’ve also made available an initial set of recipes that enable scenarios such as Deep Learning, Computational Fluid Dynamics (CFD), Molecular Dynamics (MD) and Video Processing with Batch Shipyard. In fact, we are aiming to make Deep Learning on Azure Batch an easy, low friction experience. Once you have the toolkit installed and have Azure Batch and Azure Storage credentials, you can get CNTK, Caffe or TensorFlow running in an Azure Batch compute pool in under 15 minutes. Below is a screenshot of CNTK running on a GPU-enabled STANDARD_NC6 VM via Batch Shipyard with nvidia-smi:

CNTK

We hope to continue to expand the repertoire of recipes available for Batch Shipyard in the future.

The Batch Shipyard toolkit can be found on GitHub. We welcome any feedback and contributions!

Deep Learning, Simulation and HPC Applications with Docker and Azure Batch

Explore

Related posts

Microsoft to showcase purpose-built AI infrastructure at NVIDIA GTC

Advancing global network reliability through intelligent software—part 1 of 2

Announcing low-priority VMs on scale sets now in public preview

New NVIDIA GPUs coming to Azure accelerate HPC and AI workloads

Popular

AI + machine learning

Analytics

Compute

Containers

Databases

DevOps

Developer tools

Hybrid + multicloud

Identity

Integration

Internet of Things

Management and governance

Media

Migration

Mixed reality

Mobile

Networking

Security

Storage

Web

Virtual desktop infrastructure

Use cases

Application development

AI

Cloud migration and modernization

Data and analytics

Hybrid cloud and infrastructure

Internet of Things

Security and governance

Organization type

Resources

Explore

Related posts