Scale up your deep learning with Batch AI preview | Azure Blog

Imagine reducing your training time for an epoch from 30 minutes to 30 seconds, and testing many different hyper-parameter weights at the same time. Available now in public preview, Batch AI is a new service that helps you train and test deep learning and other AI or machine learning models with the same scale and flexibility used by Microsoft’s data scientists.

Updated on December 14, 2018: Azure Batch AI capabilities are now available as part of Azure Machine Learning service. Azure Batch AI service will be retired over the coming months.

Imagine reducing your training time for an epoch from 30 minutes to 30 seconds, and testing many different hyper-parameter weights in parallel. Available now, in public preview, Batch AI is a new service that helps you train and test deep learning and other AI or machine learning models with the same scale and flexibility used by Microsoft’s data scientists. Managed clusters of GPUs enable you to design larger networks, run experiments in parallel and at scale to reduce iteration time and make development easier and more productive. Spin up a cluster when you need GPUs, then turn them off when you’re done and stop the bill.

Developing powerful AI involves combining large data sets for training with clusters of GPUs for experimenting with network design and optimization of hyper-parameters. Having access to this capability as a service helps data scientists and AI researchers get results faster and focus on building better models instead of managing infrastructure. This is where Batch AI comes in as part of the Microsoft AI platform.

“Deep learning researchers require increasing computing time to train complex neural networks with big data. Large computing clusters on Microsoft Azure is one of the solutions to resolve our researchers' pain, and Azure Batch AI will be the key solution to connect on-premises and cloud environments. Preferred Networks is excited to integrate Chainer & ChainerMN with this service.” –Hiroshi Maruyama, Chief Strategy Officer, Preferred Networks, Inc

Joseph Sirosh, Corporate Vice President of the Cloud AI Platform, spoke at the recent Microsoft Ignite conference about delivering Cloud AI for every developer with a comprehensive family of infrastructure for AI in Azure, services for AI, and tools to make AI development easier. Batch AI is part of this infrastructure, enabling easily distributed computing on Azure for parallel training, testing, and scoring. Scale-out to as many GPUs as you want.

There’s a great demo in Joseph’s Ignite talk (25 minutes in) that shows an end-to-end experience of data wrangling, training at scale, and using a trained AI model in Excel. The model was developed initially using a Data Science Virtual Machine in Azure, then scaled out to speed up experimentation, hyper-parameter tuning, and training. Using Batch AI, our data scientists were able to scale from 1 to 148 GPUs for the model, reducing training time per epoch from 30 minutes to 30 seconds. This made a huge difference in productivity when you need to run thousands of epochs. Our data scientists were able to experiment with the network design and hyper-parameter values and see results quickly. A version of the code behind this demo will available as a tutorial to use with Batch AI and Azure ML Machine Learning Services and Workbench.

What is Batch AI

Batch AI provides an API and services specialized for AI workflows. The key concepts are clusters and jobs.

Cluster describes the compute resources you want to use. Batch AI enables:

Provisioning clusters of GPUs or CPUs on demand
Installing software in a container or with a script
Automatic or manual scaling to manage costs
Access to low priority virtual machines for learning and experimentation
Mounting shared storage volumes for training and output data

A job is the code you want to run — a command line with parameters. Batch AI supports:

Using any deep learning framework or machine learning tools
Direct configuration of options for popular frameworks
Priority-based job queue for sharing a GPU quota or reserved instances
Restarting jobs if a virtual machine becomes unavailable
SDK, command line, portal and tools integration

Building systems of intelligence

Dr. Yogendra Narayan Pandey, Data Scientist at Halliburton Landmark, used Azure Batch AI and Azure Data Lake to develop predictive deep learning algorithms for static reservoir modeling to reduce the time and risk in oil field exploration compared to traditional simulation. He shared his work at the Landmark Innovation Forum & Expo 2017.

“With the huge amounts of storage and compute power of the Azure cloud, we are entering the age of predictive model-based discovery. Batch AI makes it straightforward for data scientists to use the tools they already know. Without Azure Batch AI and GPUs, it would have taken hours if not days for each model training job to complete.”

Batch AI includes recipes for popular AI frameworks that help you get started quickly without needing to learn the details of working with Azure virtual machines, storage, and networking. The recipes include cluster and job templates to use with the Azure CLI interface, as well as Jupyter Notebooks that demonstrate using the Python API.

End-to-end productivity

The Batch AI team is working to integrate with Microsoft AI tools including the Azure Machine Learning services and Workbench for data wrangling, experiment management, deployment of trained models, and Visual Studio Code Tools for AI.

Partners around the world are also using Batch AI to help their customers scale-up their training to Azure and its powerful fleet of NVIDIA GPUs.

“We have long needed a service like Azure Batch AI. It is an appealing solution for deep learning engineers to speed up deep neural network training & hyper parameter search. I’m looking forward to creating end-to-end solutions by integrating our deep learning service CSLAYER and Azure Batch AI.” –Ryo Shimizu, President & CEO of UEI Corporation

Getting started

We invite you to try Batch AI for training your models in parallel and at scale in Azure. We have sample recipes for popular AI frameworks to help you get started. We recommend starting with low priority virtual machines to minimize costs.

With Batch AI, you only pay for the compute and storage used for your training. There’s no additional charge for the cluster management and job scheduling. Using low priority virtual machines with Batch AI is the most cost-effective way to learn and develop until you are ready to leverage GPUs.

The team would like to hear any feedback or suggestions you have. We’re listening on Azure Feedback, Stack Overflow, MSDN, and by email.