Apache Spark for Azure HDInsight

Apache Spark in the cloud for mission-critical deployments

What is Apache Spark?

Apache Spark is an open-source processing framework that runs large-scale data analytics applications. Spark is built on an in-memory compute engine, which enables high-performance querying on big data. It takes advantage of a parallel data-processing framework that persists data in-memory and disk if needed. This allows Spark to deliver 100-times faster speed and a common execution model for tasks such as extract, transform, load (ETL), batch, interactive queries and others on data in an Apache Hadoop Distributed File System (HDFS). Azure makes Apache Spark easy and cost effective to deploy with no hardware to buy, no software to configure, a full notebook experience for authoring compelling narratives and integration with partner business intelligence tools.

Watch an Apache Spark overview video

The Apache Spark core engine provides a processing framework that can combine different types of processing, including Spark SQL, Spark Streaming, MLlib (machine learning) and GraphX (graph computation).

One execution model for multiple tasks

Apache Spark takes advantage of a common execution model for doing multiple tasks such as ETL, batch queries, interactive queries, real-time streaming, machine learning and graph processing on data stored in Azure Data Lake Store. This allows you to use Spark for Azure HDInsight to solve big data challenges in near real-time, such as fraud detection, click-stream analysis, financial alerts, telemetry from Internet of Things (IoT) sensors and devices, social analytics, always-on ETL pipelines and network monitoring.

In-memory processing for interactive scenarios

Customers today users expect quick answers to their questions instead of waiting minutes, hours or days. Apache Spark delivers by persisting data in-memory to get up to 100 times faster queries while processing large datasets in Hadoop. This makes Spark for Azure HDInsight ideal for speeding up intensive big data applications.

Use IntelliJ IDEA for native developer experiences and remote debugging

To make development on Spark easier, we’ve introduced deep integration with IntelliJ IDEA to allow you to code with native authoring support for Scala and Java. You can do remote debugging, which gives you flexibility in your development life cycle and the ability to submit the application to Azure when ready. Spark for HDInsight clusters also comes pre-loaded with the most popular Python libraries (Anaconda) for machine learning.

Take advantage of BI tools to interactively analyse big data

For business analysts, we offer integration with Power BI alongside other business intelligence tools such as Tableau, SAP BusinessObjects Lumira and QlikView. This lets you build interactive visualisations over data of any size. In addition to the traditional dashboards, Power BI gives you a streaming connector that integrates with Spark, which allows you to publish real-time events from Spark Streaming directly to Power BI.

Out-of-the-box notebook experience

Unlike other Spark offerings, which require you to install your own notebooks or take advantage of proprietary ones, Spark for HDInsight has out-of-the-box integration with Jupyter (iPython), the most popular open-source notebook in the market. This allows you to create narratives that combine code, statistical equations and visualisations that tell a story about the data. To make integration easier for you, we worked with the Jupyter community to enhance the kernel and allow Spark execution through a REST endpoint, which gives a compelling experience for data scientists.

Integrated with R Server – a large R-compatible parallel analytics and machine-learning library

Use Spark for Azure HDInsight as an engine to run R Server, which has a large parallel analytics and machine-learning library built to work with the open-source R language. This lets you take advantage of the familiarity of R, with the enterprise scale from R Server running on Spark. Multi-threaded maths libraries and transparent parallelisation in R Server, combined with Spark, means handling up to 1000 times more data and up to 50 times faster speeds than open-source R – which helps you to train more accurate models for better predictions than before.

Highest availability for business continuity

To run Spark at the highest scale, Microsoft gives the industry’s highest availability SLA at 99.9% to ensure your business continuity and protection against catastrophic events. We co-led with Cloudera and the project Livy to create an open-source Apache-licensed REST web service for managing long-running Spark contexts and submitting Spark jobs. This new capability is designed to make Spark a more robust back-end for running interactive notebooks and to allow other applications to take advantage of Spark for their interactive workloads.

Analyse any data of any size without changes as data grows

To make sure that Spark runs at scale, we integrated Spark with Azure Data Lake Store. This integration is uniquely available from Microsoft and allows Spark to store and process data that scales to any size, without forcing changes to your application as data grows. Through this integration, you can implement role-based data access controls at the storage level.

Real-time processing for real-time scenarios

Today’s connected world is defined by big data that arrives in real time. Spark Stream for HDInsight is ideal for challenging real-time scenarios. It enables various opportunities including Internet of Things (IoT) scenarios, real-time remote management and monitoring, and getting insights from devices such as mobile phones or connected cars.

Easy setup, fast results

There’s no time-consuming installation or setup with Spark for HDInsight. Azure does it for you. You’ll be up and running in minutes and can deploy Spark without buying new hardware or paying other up-front costs.

Elastic capacity for big data

Spark for HDInsight takes advantage of the power of Azure, which makes it easier for you to create clusters of any size to process any amount of data on demand. You only pay for the compute and storage that you use.

Try HDInsight for free