The ability to run Spark on a GPU enabled cluster demonstrates a unique convergence of big data and high-performance computing (HPC) technologies. In the past several years, we've seen the GPU market explode as companies all over the world integrate AI and other HPC workflows into their businesses. Tensorflow, a framework designed to utilize GPUs for numerical computation and neural networks has skyrocketed into popularity, a testament to the rise of AI and consequently the demand for GPUs. Simultaneously, the need for big data and powerful data processing engines has never been greater as hundreds of companies start to collect data in the petabyte range.
By providing infrastructure for high performance hardware such as GPUs with big data engines such as Spark, data scientists and data engineers can enable many scenarios that would otherwise be difficult to achieve.
Along with the recent release of our latest GPU SKUs, I'm excited to share that we now support running Spark on a GPU-enabled cluster using the Azure Distributed Data Engineering Toolkit (AZTK). In a single command, AZTK allows you to provision on demand GPU-enabled Spark clusters on top of Azure Batch's infrastructure, helping you take your high performance implementations that are usually single-node only and distribute it across your Spark cluster.
For this release, we have created several additional GPU-enabled Docker images for AZTK, including a python image that comes packaged with Anaconda, Jupyter and PySpark, and a R image that comes packaged with Tidyverse, RStudio-Server and SparklyR.
These images use the NVIDIA Docker Engine to provide our Docker images access to the host's GPUs. Because AZTK runs Spark in a completely containerized fashion, users can customize their own GPU Docker images to their specific needs. However, for those users who simply want to run Spark on a GPU-enabled cluster, they can do so without needing to worry about Docker as well. AZTK will automatically pull the appropriate image, giving you GPU access if GPUs are detected on the host machine.
Here's an example of how you can create a four-node GPU enabled Spark cluster (total of 224GB in memory, four GPUs where one GPU = one-half K80 card, and 24 vCPUs) with AZTK:
$ aztk spark cluster create --id my_gpu_cluster --size 4 --vm-size standard_nc6
Since AZTK is aware that the Standard NC6 VMs come with NVIDIA's Tesla K80s, AZTK automatically selects one of the GPU-enabled Docker images when provisioning your cluster. Alternatively, you can also manually specify which image to use by setting the –docker-repo flag.
We also provide a sample that compares a simple PySpark job using GPUs with Numba vs using CPUs to highlight the performance gain you can get when running your Spark jobs with GPUs.
Whether you will be using Spark with GPUs for AI workflow such as with Tensoflow/Tensorframes, distributed CTNK, or using it simply to speed up your computationally expensive Spark jobs, please let us know about how you plan to take advantage of this unique convergence of HPC and Big Data technologies.
We look forward to you using these capabilities and hearing your feedback. Please contact us at askaztk@microsoft.com for feedback or feel free to contribute to our Github repository.
Additional information
- Download and get started with the Azure Distributed Data Engineering Toolkit (AZTK)
- Please feel free to submit issues via Github
Additional resources
- See Azure Batch, the underlying Azure service used by the Azure Distributed Data Engineering Toolkit
- More general purpose HPC on Azure