Apache Spark for Azure HDInsight
Apache Spark in the cloud for mission critical deployments
- Managed and supported with the highest SLA availability
- Analyze any data of any size without changes as your data grows
- Make your data more secure with role-based access controls at the storage level
- Out-of-the-box notebook integration to interactively explore patterns
- Integrated with popular business intelligence tools
- Use IntelliJ IDA for native developer experiences and remote debugging
- Speed up querying 100x on big data with in-memory technologies
- Querying, streaming, machine learning, and graph processing with the same execution model
What is Apache Spark?
Apache Spark is an open-source processing framework that runs large-scale data analytics applications. Spark is built on an in-memory compute engine, which enables high performance querying on big data. It takes advantage of a parallel data processing framework that persists data in-memory and disk if needed. This allows Spark to deliver 100x faster speed and a common execution model for tasks such as extract, transform, load (ETL), batch, interactive queries, and others on data in an Apache Hadoop Distributed File System (HDFS). Azure makes Apache Spark easy and cost effective to deploy with no hardware to buy, no software to configure, a full notebook experience to author compelling narratives, and integration with partner business intelligence tools.
One execution model for multiple tasks
Apache Spark takes advantage of a common execution model for doing multiple tasks like ETL, batch queries, interactive queries, real-time streaming, machine learning, and graph processing on data stored in Azure Data Lake Store. This allows you to use Spark for Azure HDInsight to solve big data challenges in near real-time, such as fraud detection, click stream analysis, financial alerts, telemetry from Internet of Things (IoT) sensors and devices, social analytics, always-on ETL pipelines, and network monitoring.
In-memory processing for interactive scenarios
Customers today expect quick answers to their questions, instead of waiting minutes, hours, or days. Apache Spark delivers by persisting data in-memory to get up to 100x faster queries, while processing large datasets in Hadoop. This makes Spark for Azure HDInsight ideal to speed up intensive big data applications.
Use IntelliJ IDEA for native developer experiences and remote debugging
To make development on Spark easier, we introduced deep integration with IntelliJ IDEA to allow you to code with native authoring support for Scala and Java. You can do remote debugging, which gives you flexibility in your development lifecycle and the ability to submit the application to Azure when ready. Spark for HDInsight clusters also come pre-loaded with the most popular Python libraries (Anaconda) for machine learning.
Take advantage of BI tools to interactively analyze big data
For business analysts, we offer integration with Power BI alongside other business intelligence tools like Tableau, SAP BusinessObjects Lumira, and QlikView. This lets you build interactive visualizations over data of any size. In addition to the traditional dashboards, Power BI gives you a streaming connector that integrates with Spark, which allows you to publish real-time events from Spark Streaming directly to Power BI.
Out-of-the-box notebook experience
Unlike other Spark offerings, which require you to install your own notebooks or take advantage of proprietary ones, Spark for HDInsight has out-of-the-box integration with Jupyter (iPython), the most popular open source notebook in the market. This allows you to create narratives that combine code, statistical equations, and visualizations that tell a story about the data. To make integration easier for you, we worked with the Jupyter community to enhance the kernel and allow Spark execution through a REST endpoint, which gives a compelling experience for data scientists.
Integrated with R Server—a large R-compatible parallel analytics and machine learning libraryUse Spark for Azure HDInsight as an engine to run R Server, which has a large parallel analytics and machine learning library built to work with the open-source R language. This lets you take advantage of the familiarity of R, with the enterprise-scale from R Server running on Spark. Multithreaded math libraries and transparent parallelization in R Server, combined with Spark, means handling up to 1000x more data and up to 50x faster speeds than open-source R—which helps you to train more accurate models for better predictions than before.
Highest availability for business continuity
To run Spark at the highest scale, Microsoft gives you the industry’s highest availability SLA at 99.9% to ensure your business continuity and protection against catastrophic events. We co-led with Cloudera and the project Livy to create an open-source Apache-licensed REST web service for managing long-running Spark contexts and submitting Spark jobs. This new capability is designed to make Spark a more robust back end for running interactive notebooks and allow other applications to take advantage of Spark for their interactive workloads.
Analyze any data of any size without changes as data grows
To make sure Spark runs at scale, we integrated Spark with Azure Data Lake Store. This integration is uniquely available from Microsoft and allows Spark to store and process data that scales to any size, without forcing changes to your application as data grows. Through this integration, you can implement role-based data access controls at the storage level.
Real-time processing for real-time scenarios
Today’s connected world is defined by big data that arrives in real-time. Spark Stream for HDInsight is ideal for challenging real-time scenarios. It enables various opportunities including Internet of Things (IoT) scenarios, real-time remote management and monitoring, and getting insights from devices like mobile phones or connected cars.
Easy setup, fast results
There’s no time-consuming installation or set up with Spark for HDInsight. Azure does it for you. You’ll be up and running in minutes, and can deploy Spark without buying new hardware or paying other up-front costs.
Elastic capacity for big data
Spark for HDInsight takes advantage of the power of Azure, which makes it easier for you to create clusters of any size to process any amount of data on demand. You only pay for the compute and storage that you use.