4 min read
Today, we are pleased to announce that Apache Spark v1.6.1 for Azure HDInsight is generally available. Since we announced the public preview, Spark for HDInsight has gained rapid adoption and is now 50% of all new HDInsight clusters deployed. With GA, we are revealing improvements we’ve made to the service to make Spark hardened for the enterprise and easy for your users. This includes improvements to the availability, scalability, and productivity of our managed Spark service.
What is Apache Spark?
Apache Spark is an open source processing framework that runs large-scale data analytics applications in-memory. This allows Spark to deliver queries up to 100 times faster than traditional big data solutions, along with a common execution model for various tasks like extract-transform-load (ETL) processes, batch queries, interactive queries, real-time streaming, machine learning, and graph processing on data stored.
What is Microsoft’s Apache Spark for Azure HDInsight?
Microsoft has been on a journey to make big data easy and more approachable. This is encompassed in Cortana Intelligence, our big data and analytics suite. As part of this solution, we offer Azure HDInsight, Microsoft’s managed Hadoop and Spark cloud service that runs the Hortonworks Data Platform. Spark for Azure HDInsight offers customers an enterprise-ready Spark solution that’s fully managed, secured, and highly available and made simpler for users with compelling and interactive experiences.
- Enterprise-ready Spark implementation: We’ve drawn on years of experience working with enterprise customers, running some of the largest data projects in the world. For Spark to run at this scale, Microsoft had to ensure that it is highly available, scalable, and secure.
- For high availability, Microsoft worked with Hortonworks to add capabilities to the YARN resource manager and co-led “Project Livy” with Cloudera and other organizations to create an open source Apache licensed REST web service for managing long running Spark contexts and submitting Spark jobs. This new capability was designed to make Spark a more robust back-end for running interactive notebooks and allow other applications to leverage Spark for their interactive workloads. By ensuring high availability with Spark, we now offer the highest guarantee for Spark in the market with a 99.9% service level agreement.
- To ensure that Spark will run at scale, we are announcing integration between Spark and Azure Data Lake Store. This will allow Spark to store and process data of any size built on a repository designed for the cloud to capture data of any size, type and speed without forcing changes to your application as data scales.
- For securing Spark, we are enabling role-based data access at the storage level through the integration of Spark and Data Lake Store.
- Spark made simpler: Our goal with big data is to make it accessible for everybody. With Spark for HDInsight, we have designed new productivity experiences for the different audiences that use Spark including the data engineer working on ETL jobs, the data scientists who are performing experimentation and the business analysts who are creating dashboards.
- For the data engineer and developers, we introduced deep integration with the IntelliJ IDE. This allows developers to code with native authoring support for Scala and Java, local testing, remote debugging, and the ability to submit Spark applications to the Azure cloud.
- For data scientists, we introduced out-of-the-box integration with Jupyter (iPython) notebooks allowing you to create narratives that combine code, statistical equations, and visualizations that tell a story about the data. This environment is ideal for extracting data from any source and iteratively building ML models while writing exploratory queries to visualize and understand properties of the data. We made this possible by working with the Jupyter OSS community to enhance the kernel to allow Spark execution through a REST endpoint. As a result, Jupyter notebooks are now accessible within HDInsight out-of-the-box.
- For the business analysts, we offer integration with Power BI alongside other BI tools like Tableau, SAP Lumira, and QlikView. This lets you build interactive visualizations over data of any size. In addition to the traditional dashboards, Power BI offers a streaming connector that has integration with Spark allowing you to publish real-time events from Spark Streaming directly to Power BI.
How do I get started?
To get started, customers will need to have an Azure subscription or a free trial to Azure. With this in hand, you should be able to get a Spark cluster up and running in minutes by going through this getting started guide.
Also, head over to watch this Channel 9 video below on Azure Fridays:
Documentation and How-To’s:
- Overview of Apache Spark for Azure HDInsight
- Getting started with Spark for Azure HDInsight
- Using BI tools with Spark for HDInsight to do interactive analysis on big data
- Kernels available for Jupyter notebooks with HDInsight Spark
- Use external packages with Jupyter notebooks in Apache Spark clusters
- Install Jupyter notebook on your computer and connect to Apache Spark cluster on Azure HDInsight
- Build Machine Learning applications to run on Apache Spark on HDInsight
- Predictive analysis on food inspection data using MLlib with Spark
- Use HDInsight Tools Plugin for IntelliJ IDEA to create Spark Scala applications
- Create a standalone Scala application to run on HDInsight Spark
- Spark Streaming: Process events from Azure Event Hubs with Apache Spark
- Analyze website logs using a custom library with HDInsight Spark on Linux
- Submit Spark jobs remotely to an HDInsight Spark cluster on Linux using Livy
- Manage resources for the Apache Spark cluster on HDInsight Linux