Apache Spark for Azure HDInsight

Apache Spark in the cloud for mission-critical deployments

What is Apache Spark?

Apache Spark is an open-source processing framework that runs large-scale data analytics applications. Built on an in-memory compute engine, Spark enables high-performance querying on big data. It leverages a parallel data processing framework that persists data in-memory and on disk if needed. This allows Spark to deliver 100 times faster speed and a common execution model to various tasks such as extract, transform, load (ETL), batch, interactive queries and others on data in a Hadoop Distributed File System (HDFS). Azure makes Apache Spark easy and cost effective to deploy with no hardware to buy, no software to configure, a full notebook experience for authoring compelling narratives and integration with partner business intelligence tools.

Watch an overview video

The Apache Spark core engine provides a processing framework that can combine different types of processing, including Spark SQL, Spark Streaming, MLlib (machine learning) and GraphX (graph computation).

One execution model for multiple tasks

Apache Spark leverages a common execution model for doing multiple tasks such as ETL, batch queries, interactive queries, real-time streaming, machine learning and graph processing on data stored in Azure Data Lake Store. This allows you to use Spark for Azure HDInsight to solve big data challenges in near real-time, such as fraud detection, click stream analysis, financial alerts, telemetry from connected sensors and devices (Internet of Things, IoT), social analytics, always-on ETL pipelines and network monitoring.

In-memory processing for interactive scenarios

Today’s users expect quick answers to their questions instead of waiting minutes, hours or days. Apache Spark delivers by persisting data in-memory to achieve up to 100 times faster queries while processing large datasets in Hadoop. This makes Spark for Azure HDInsight ideal for speeding up intensive big data applications.

Native developer experiences and remote debugging using IntelliJ IDEA

To make development on Spark easier, we’ve introduced deep integration with IntelliJ IDEA to allow developers to code with native authoring support for Scala and Java. You can do remote debugging, which gives you flexibility in your development life cycle and the ability to submit the application to Azure when ready. Spark for HDInsight clusters also comes pre-loaded with the most popular Python libraries (Anaconda) for machine learning.

Leverage BI tools to interactively analyse big data

For business analysts, we offer integration with Power BI alongside other business intelligence tools such as Tableau, SAP Lumira and QlikView. This lets you build interactive visualisations over data of any size. In addition to the traditional dashboards, Power BI offers a streaming connector that has integration with Spark, allowing you to publish real-time events from Spark Streaming directly to Power BI.

Out-of-the-box notebook experience

Unlike other Spark offerings, which require you to install your own notebooks or leverage proprietary ones, Spark for HDInsight has out-of-the-box integration with Jupyter (iPython), the most popular open-source notebook in the market. This allows you to create narratives that combine code, statistical equations and visualisations that tell a story about the data. To simplify the integration for our customers, we worked with the Jupyter community to enhance the kernel allowing Spark execution through a REST endpoint, which gives a compelling experience for data scientists.

Integrated with R Server – the largest R-compatible parallel analytics and ML library

Spark for Azure HDInsight can be leveraged as an engine to run R Server, which has the largest parallel analytics and machine-learning library built to work with the open-source R language. This lets you leverage the familiarity of R, with the enterprise scale from R Server running on Spark. Multi-threaded maths libraries and transparent parallelisation in R Server combined with Spark mean handling up to 1000 times more data and up to 50 times faster speeds than open-source R – helping you train more accurate models for better predictions than previously possible.

Highest availability guarantee for business continuity

To run Spark at the highest scale, Microsoft provides the industry’s highest availability SLA guarantee at 99.9% to ensure your business continuity and protection against catastrophic events. We did this by co-leading with Cloudera the project Livy to create an open-source Apache-licensed REST web service for managing long-running Spark contexts and submitting Spark jobs. This new capability was designed to make Spark a more robust back-end for running interactive notebooks and to allow other applications to leverage Spark for their interactive workloads.

Analyse any data of any size without changes as data grows

To make sure that Spark will run at scale, we integrated Spark with Azure Data Lake Store. This integration is uniquely available from Microsoft, allowing Spark to store and process data that scales to any size without forcing changes to your application as data grows. Through this integration, you’ll also be able to implement role-based data access controls at the storage level.

Real-time processing for real-time scenarios

Today’s connected world is defined by big data that arrives in real time. Spark Stream for HDInsight is ideal for challenging real-time scenarios. It will enable various opportunities including IoT scenarios such as real-time remote management and monitoring, or gaining insights from devices such as mobile phones or connected cars.

Easy setup, fast results

With Spark for HDInsight, there’s no time-consuming installation or setup. Azure does it for you. You’ll be up and running in minutes and can deploy Spark without buying new hardware or facing other up-front costs.

Elastic capacity for big data

Spark for HDInsight leverages the power of the Azure cloud, making it easier to create clusters of any size to process any amount of data on demand. We only charge for the compute and storage you actually use.

Try HDInsight for free