• 3 min read

Apache Spark jobs gain up to 9x speed up with HDInsight IO Cache

We are pleased to reveal the preview of HDInsight IO Cache, a new transparent data caching feature of Azure HDInsight that provides customers with up to a 9x performance improvement for Apache Spark jobs.

Today, we are pleased to reveal the preview of HDInsight IO Cache, a new transparent data caching feature of Azure HDInsight that provides customers with up to a 9x performance improvement for Apache Spark jobs. We know from our customers that when it comes to analytics cost efficiency of managed cloud-based Apache Hadoop and Spark services is one of their major attractors. HDInsight IO Cache allows us to improve this key value proposition even further by improving performance without a corresponding increase in costs.

Architecture

Azure HDInsight is a cloud platform service for open source analytics that aims to bring the best open source projects and integrate them natively on Azure. There are many open source caching projects that exists in the ecosystem: Alluxio, Ignite, and RubiX to name a few prominent ones.

HDInsight IO Cache is based on RubiX. RubiX is one of the more recent projects and has a distinct architecture. Unlike other caching projects, it doesn’t reserve operating memory for caching purposes. Instead, it leverages recent advances in SSD technology to their fullest potential to make explicit memory management unnecessary. Modern SSDs routinely provide more than 1GB per second of bandwidth. Coupled with automatic operating system in-memory file cache, this provides more than sufficient bandwidth to load big data compute processing engines, such as Apache Spark. The operating memory in this architecture remains available for Apache Spark to process heavily memory dependent tasks, such as shuffles, which allows it to achieve highest resource utilization and further improves performance. Overall, this results in a great speed up for jobs that read data from remote cloud storage, which is the dominant architecture pattern in the cloud.

Getting started

Azure HDInsight IO Cache is available on Azure HDInsight 3.6 and 4.0 Spark clusters on the latest version of Apache Spark 2.3. During Preview, this feature is deactivated by default. To activate it, in Ambari management UI of the cluster, select HDInsight IO Cache service, then click Actions > Activate.

Ambari IO Cache Service  Action activate

Proceed by confirming restart of the affected services on the cluster.

Once activated, HDInsight IO Cache launches and manages RubiX Cache Metadata Servers on each worker node of the cluster. It also configures all services of the cluster for transparent use of RubiX cache. This allows you to benefit from caching without making any changes to Spark jobs. For example, when IO Cache is deactivated, this Spark code would perform a regular remote read from Azure Blob Storage:

spark.read.load('wasbs:///myfolder/data.parquet').count()

When IO Cache is activated, the same line of code would perform a cached read through RubiX cache, and on subsequent reads, will read data locally from SSD. Worker nodes on HDInsight cluster are equipped with locally attached, dedicated SSD drives. HDInsight IO Cache uses these local SSDs for caching, which minimizes latency and maximizes bandwidth.

Performance benchmarking

We compared the performance of an Azure HDInsight Spark cluster with IO Cache and corresponding optimized settings to the previous version of the cluster without the IO Cache feature. We used a benchmark derived from TPC-DS and comprised of 99 SQL queries analyzing a 1TB dataset. The configuration of the cluster in both cases is the same, consisting of 16xD14v2 worker node VMs running HDInsight 3.6 Spark 2.3. The results show up to a 9x performance improvement in the query run time and a 2.25x improvement in the geomean.

Top 20 TPC-DS queries

Total running time

Summary

HDInsight IO Cache is now available in preview on the latest Azure HDInsight Apache Spark clusters. Once enabled, it improves the performance of Spark jobs in a completely transparent manner without any changes to the jobs required. This provides an excellent cost to performance ratio of cloud-based Spark deployments. Try it now on Azure HDInsight. For questions and feedback, please reach out to AskHDInsight@microsoft.com.

About HDInsight

Azure HDInsight is Microsoft’s premium managed offering for running open source workloads on Azure. Apache Spark clusters on HDInsight provide best in class security features, authorization and audit controls, development tools for authoring, and diagnostics of Spark jobs. Azure HDInsight powers some of the top customer’s mission critical applications in a wide range of sectors including manufacturing, retail, education, nonprofit, government, healthcare, media, banking, telecommunication, and insurance. With these industries comes a range in use cases including ETL, data warehousing, machine learning, IoT, and many more.