How to install Spark 1.2 on Azure HDInsight clusters

By Maxim Lukiyanov Principal Program Manager, Azure Machine Learning

How to install Spark 1.2 on Azure HDInsight clusters • 2 min read

Posted on March 16, 2015
2 min read

Update as of 5/2/2016: Azure HDInsight now supports Apache Spark out of the box. Find more information here.

Today we are pleased to announce the refresh of the Apache Spark support on Azure HDInsight clusters. Spark is available on HDInsight through custom script action and today we are updating it to support the latest version of Spark 1.2. The previous version supported version 1.0. This update also adds Spark SQL support to the package. Spark 1.2 script action requires latest version of HDInsight clusters 3.2. Older HDInsight clusters will get previous version of Spark 1.0 when customized with Spark script action. Follow the below steps to create Spark cluster using Azure Portal:

Chose New HDinsight Hadoop cluster (other cluster types are also supported) using the custom create option.
Select version 3.2 of the cluster.
Complete the rest of the steps of the wizard to specify cluster name, such as it’s storage account and other configuration.
In the last step of the configuration wizard, add the Spark script action: https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv03/spark-installer-v03.ps1.

Click the check mark to create the cluster. When the operation completes, your HDInsight cluster will have Spark 1.2 installed on it.

Running Spark SQL queries in Spark Shell

The new version of Spark package includes Spark SQL. Spark SQL allows you to use Spark to run relational queries expressed in SQL, HiveQL, or Scala. Using this functionality, you can run Hive queries in Spark shell.

Open Remote Desktop Connection to the cluster. For instructions, see Connect to HDInsight clusters using RDP.
Open the Hadoop Command Line using a Desktop shortcut, and navigate to the location where Spark is installed, C:appsdistspark-1.2.0.
Run the following command to start the Spark shell.

.binspark–shell −−master yarn

On the Scala prompt, set the Hive context. This is required to work with Hive queries using Spark.

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

Run a Hive query and print the output to the console. The query retrieves data from a sample Hive table that exists on every HDInsight cluster. It queries devices of a specific make and limits the number of records retrieved to 20. The triple quotes is a Scala syntax to allow embedded quotes in the string.

hiveContext.sql(“””SELECT * FROM hivesampletable WHERE devicemake LIKE “HTC%” LIMIT 20″””).collect().foreach(println)

You should see an output like the following:

You can find more information on installation steps and usage of Spark in the documentation links below. You can also install R, Solr and Giraph using script actions as well as create your own:

How to install Spark 1.2 on Azure HDInsight clusters

Running Spark SQL queries in Spark Shell

Explore

Related posts

Gérer vos besoins big data avec HDInsight sur AKS

Monitoring on Azure HDInsight part 4: Workload metrics and logs

Migrate to Azure HDInsight in as little as 12 weeks

Azure.Source – Volume 87

Join the conversation

Sélection

IA + Machine Learning

Analyse

Calcul

Conteneurs

Bases de données

DevOps

Outils de développement

Hybride + multicloud

Identité

Intégration

Internet des Objets

Gestion et gouvernance

Données multimédias

Migration

Réalité mixte

Mobile

Mise en réseau

Sécurité

Stockage

Web

Bureau virtuel Windows

Cas d'utilisation

Développement d’applications

IA

Migration et modernisation cloud

Données et analyse

Cloud hybride et infrastructure

Internet des Objets

Sécurité et gouvernance

Type d’organisation

Ressources

Running Spark SQL queries in Spark Shell

Explore

Related posts

Join the conversation