Update as of 5/2/2016: Azure HDInsight now supports Apache Spark out of the box. Find more information here.
Today we are pleased to announce the refresh of the Apache Spark support on Azure HDInsight clusters. Spark is available on HDInsight through custom script action and today we are updating it to support the latest version of Spark 1.2. The previous version supported version 1.0. This update also adds Spark SQL support to the package. Spark 1.2 script action requires latest version of HDInsight clusters 3.2. Older HDInsight clusters will get previous version of Spark 1.0 when customized with Spark script action. Follow the below steps to create Spark cluster using Azure Portal:
- Chose New HDinsight Hadoop cluster (other cluster types are also supported) using the custom create option.
- Select version 3.2 of the cluster.
- Complete the rest of the steps of the wizard to specify cluster name, such as it’s storage account and other configuration.
- In the last step of the configuration wizard, add the Spark script action: https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv03/spark-installer-v03.ps1.
- Click the check mark to create the cluster. When the operation completes, your HDInsight cluster will have Spark 1.2 installed on it.
Running Spark SQL queries in Spark Shell
The new version of Spark package includes Spark SQL. Spark SQL allows you to use Spark to run relational queries expressed in SQL, HiveQL, or Scala. Using this functionality, you can run Hive queries in Spark shell.
- Open Remote Desktop Connection to the cluster. For instructions, see Connect to HDInsight clusters using RDP.
- Open the Hadoop Command Line using a Desktop shortcut, and navigate to the location where Spark is installed, C:appsdistspark-1.2.0.
- Run the following command to start the Spark shell.
.binspark–shell −−master yarn
- On the Scala prompt, set the Hive context. This is required to work with Hive queries using Spark.
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
- Run a Hive query and print the output to the console. The query retrieves data from a sample Hive table that exists on every HDInsight cluster. It queries devices of a specific make and limits the number of records retrieved to 20. The triple quotes is a Scala syntax to allow embedded quotes in the string.
hiveContext.sql(“””SELECT * FROM hivesampletable WHERE devicemake LIKE “HTC%” LIMIT 20″””).collect().foreach(println)
- You should see an output like the following:
You can find more information on installation steps and usage of Spark in the documentation links below. You can also install R, Solr and Giraph using script actions as well as create your own: