Getting Started with HDInsight

Editor’s Note: This post comes from Shayne Burgess of the Windows Azure HDInsight Team.

Yesterday we released an important preview of HDInsight Service on Windows.  This second blog in our 5-part series provides a quick walkthrough of this new update of HDInsight Service.

HDInsight provides everything you need to quickly deploy, manage and use Hadoop clusters running on Windows Azure.

If you have a Windows Azure account you can request access to the HDInsight Preview and then create an HDInsight cluster within the Windows Azure Management Portal.

Login to the portal and select HDInsight from the menu that appears after clicking the New button in the bottom left corner. Specify a name for the cluster, a password for logging into the cluster and the size of cluster you need. The size of the cluster determines the price for the cluster so be careful when choosing your cluster size.

A storage account is required to create a cluster and in the current public preview the storage account must reside in the East US region. The Azure Storage account you associate with your cluster is where you will store the data that you will analyze in HDInsight.

HDInsight Clusters

Creating a cluster will take a few minutes to create and configure the necessary Virtual Machines (VMs) that together make up your HDInsight cluster.  The Hadoop components installed as part of an HDInsight cluster are outlined here. Once the cluster is created, drill into the dashboard view to see the cluster quick glance screen. This quick glance allows you to see the basic information about your cluster and gives you a simple method to connect to the cluster.

To open the cluster’s main dashboard page, click the Manage button. The cluster will ask you to login using the username and password you specified when creating the cluster (if you used the quick create option, the default username is admin).

The cluster dashboard page will open; this page contains a number of tiles that provide information about the cluster and can be used to perform additional tasks. The Create Job tile opens a MapReduce job submission form that you can use to submit MapReduce jobs as JAR files. The Interactive Console tile opens a console that lets you execute Javascript and Hive queries directly against your cluster.

Running Samples

Your cluster’s main portal page also contains a Samples tiles that you can use to learn some of the basics of using Hadoop. 

Each sample highlights a different scenario when using HDInsight – exploring the samples will give you an overview of some of the capabilities of HDInsight and teach you how to do things like executing Hive queries and setting up SQOOP connectors.

The WordCount sample, for instance, shows you how to execute a MapReduce job that calculates the number of times a word occurs in a text file. The samples all contain a Deploy to your cluster button that will execute the sample MapReduce job on your cluster.

Examining Output in the Interactive Console

Run the WordCount sample to start a MapReduce job that will calculate the number of times words appear in the Notebooks of Leonardo DaVinci Project Gutenberg EBook. When the job completes you can use the Interactive console to view the output that has been stored in your Blob Storage account.

To view the word count, enter the command: “file = fs.read(“asv:///DaVinciAllTopWords”)” in the console prompt. Scroll back up to see the long list of words and their summary counts.  

Learn More

To continue learning about HDInsight, visit our Getting Started page.

We hope you will find HDInsight a valuable new service and are looking forward to your feedback.

Visit us on Wednesday for the next blog in our 5-part series that will focus on HDInsight and Azure Storage.