Windows Azure HDInsight is now Generally Available!

Today we announced the general availability of the HDInsight Service for Windows Azure. HDInsight is a Hadoop-based service from Microsoft that brings a 100 percent Apache Hadoop solution to the cloud.

HDInsight offers the following benefits:

  • Insights with familiar tools:  Through deep integration with Microsoft BI tools such as PowerPivot, Power View, HDInsight enables you to easily find insights in data using Hadoop.  Seamlessly combine data from several sources, including HDInsight, with Power Query.  Easily map your data with the new Power Map, a 3D mapping tool in Excel 2013.
  • Agility – HDInsight offers agility to meet the changing needs of your organization.  With a rich array of Powershell scripts you can deploy and provision a Hadoop cluster in minutes instead of hours or days.  If you need a larger cluster, simply delete your cluster and create a bigger one in minutes without losing any data.
  • Enterprise-ready Hadoop:  HDInsight offers enterprise-class security and manageability.  Thanks to a dedicated Secure Node, HDInsight helps you secure your Hadoop cluster.  In addition, we simplify manageability of your Hadoop cluster through extensive support for PowerShell scripting.
  • Rich Developer Experience:  HDInsight offers powerful programming capabilities with a choice of languages including .NET, Java and other languages.  .NET developers can exploit the full power of language-integrated query with LINQ to Hive.

Getting Started with HDInsight

An HDInsight cluster can be created from the Windows Azure Management portal by clicking the new button and selecting HDInsight from the Data Services menu. To create an HDInsight cluster specify a name for the cluster, the size of the cluster in number of data nodes and a password for logging in.

A cluster must have at least one storage account associated with it that will be the permanent storage mechanism for that cluster and the region the cluster is created in will always be the same region as the storage account chosen. At the time of general availability the storage account must reside in either West US, East US or North Europe to be associated with an HDInsight cluster. Additional storage accounts can be associated with a cluster using the custom create option.

It will take a few minutes for the cluster to be deployed and configured but once it is ready you will be presented with a getting started screen that provides links to additional help content as well as some sample code to run your first Hadoop job using HDInsight.

If you select the dashboard tab on the HDInsight page for your cluster, you will see the following screen that provides some basic information on the current status of your cluster including the usage in number of cores, job history and linked storage accounts.

Submitting Your First Map Reduce Job

Before you submit your first job you must prepare your development environment to use the HDInsight PowerShell cmdlets. The PowerShell cmdlets require two main components to be installed and configured: Windows Azure Powershell and the HDInsight PowerShell tools. Follow the links on step 1 of the Getting Started screen to setup your environment.

The Getting Started page has a screen that shows sample commands for submitting either a Hive or MapReduce job and we will start by submitting a MapReduce job.

Run the sample using these commands to create the job definition. The job definition contains all the information for your job, for example what mappers and reducers to use, which data to use as input and where to store the output. In this example we are going to use a sample MapReduce program and sample file that are included with the cluster. We will create an output directory in the samples directory to store the results.

$jarFile = “/example/jars/hadoop-examples.jar”

$className = “wordcount”

$statusDirectory = “/samples/wordcount/status”

$outputDirectory = “/samples/wordcount/output”

$inputDirectory = “/example/data/gutenberg”

$wordCount = New-AzureHDInsightMapReduceJobDefinition -JarFile $jarFile -ClassName

$className -Arguments $inputDirectory, $outputDirectory -StatusFolder $statusDirectory 

Run these commands to get your subscription information and start execution of the MapReduce program. MapReduce jobs are typically long-running this so example shows how to use the asynchronous commands to kick off execution of the job.

$subscriptionId = (Get-AzureSubscription -Current).SubscriptionId

$wordCountJob = $wordCount | Start-AzureHDInsightJob -Cluster HadoopIsAwesome -

Subscription $subscriptionId  | Wait-AzureHDInsightJob -Subscription $subscriptionId

Finally, run this command to retrieve the results of execution and display those on the PowerShell command line.

Get-AzureHDInsightJobOutput -Subscription (Get-AzureSubscription -Current).SubscriptionId -

Cluster bc-newhdstorage -JobId $wordCountJob.JobId –StandardError

The result of a MapReduce job is the information on the execution of the job itself as shown below. 

The output of the job was placed in your storage account in the “/samples/wordcount/output” directory. Open the storage viewer in the Windows Azure Portal and navigate to this file to download and view the output file.

Submitting Your First Hive Job

The Getting Started page also has a screen that shows some sample commands for connecting to your cluster and submitting a Hive job. Click the Hive button in the Job type section to see the sample.

Run this sample now by first executing this command in PowerShell to connect to your cluster.

Use-AzureHDInsightCluster HadoopIsAwesome (Get-AzureSubscription -Current).SubscriptionID

Next run this command to submit a HiveQL statement to the cluster. The statement uses a sample Hive table that is setup on the cluster by default when it is created.

Invoke-Hive “select country, state, count(*) as records from hivesampletable group by country, state order by records desc limit 5″

The query is a fairly simple select-groupby and when complete will display the results on the PowerShell command line.

Learn More

In this blog we showed you just how easy it is to get up and running with an HDInsight cluster and begin analyzing your data.  There is a lot more you can do and learn with HDInsight like uploading your own data sets, running sophisticated jobs and analyzing your results. For more details on using HDInsight visit the HDInsight documentation page or use the following links to access help articles directly.

For details on pricing visit the HDInsight pricing details page.