• 2 min read

Azure HDInsight clusters can now be customized to run a variety of Hadoop projects including Spark and R Modules

Today, we are excited to announce the ability for you to customize your Azure HDInsight clusters with projects available from the Apache Hadoop ecosystem.

Today, we are excited to announce the ability for you to customize your Azure HDInsight clusters with projects available from the Apache Hadoop ecosystem. This ecosystem is a portfolio of fast-moving open source projects that are evolving quickly. With this new feature, you can now experiment and deploy Hadoop projects to Azure HDInsight that were not possible before. This is enabled through the Script Action feature that can modify Hadoop clusters in arbitrary ways using custom scripts. This customization is available on all types of HDInsight clusters including Hadoop, HBase and Storm. To demonstrate the power of this capability, we documented the process to install the popular Spark and R modules.

Apache Spark is an open source processing framework that can run large-scale data analytics applications. Spark has been gaining popularity for its ability to handle both batch and stream processing as well as supporting in-memory and conventional disk processing. R is a free software programming language developed for statistical computing and machine learning. R’s popularity amongst statisticians and data miners have increased substantially in recent years.

To make this customization, users will need to download the latest Azure PowerShell and specify the PowerShell script that will be executed on cluster nodes as part of the cluster setup. This script needs to be written with requirements of the managed cloud environment where Azure patches nodes with OS updates, does security patches and can replace a misbehaving node at times. The script needs to be able to run and apply the customization at any time after the node was updated. Presently, customized clusters can be created using PowerShell and the .NET SDK.

To install Spark on the HDInsight Hadoop cluster you can run the following PowerShell script where “spark-installer-v01.ps1” is Script Action that installs Spark on HDInsight:

New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes

     | Set-AzureHDInsightDefaultStorage -StorageAccountName $storageAccountName
-StorageAccountKey $storageAccountKey -StorageContainerName $containerName

     | Add-AzureHDInsightScriptAction -Name "Install Spark"
-ClusterRoleCollection HeadNode,DataNode
-Uri https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv01/spark-installer-v01.ps1

     | New-AzureHDInsightCluster -Name $clusterName -Location $location

Read more on installing and using Spark on HDInsight here:

Read more on Script Action to make other customizations here :

For more information on Azure HDInsight: