Get started with Hadoop in HDInsight
HDInsight makes Apache Hadoop, a MapReduce software framework, available in a simpler, more scalable, and more cost-efficient Azure environment. HDInsight also provides a cost-efficient approach to the managing and storing of data using Azure Blob storage.
In conjunction with the general availability of Azure HDInsight, Microsoft also provides HDInsight Emulator for Azure, formerly known as Microsoft HDInsight Developer Preview. The Emulator targets developer scenarios and only supports single-node deployments. For using HDInsight Emulator, see Get Started with the HDInsight Emulator.
What does this tutorial achieve?
Assume you have a large unstructured data set and you want to run queries on it to extract some meaningful information. That's exactly what we are going to do in this tutorial. Here's how we achieve this:
You can also watch a demo video of this tutorial:
Before you begin this tutorial, you must have the following:
- An Azure subscription. For more information about obtaining a subscription, see Purchase Options, Member Offers, or Free Trial.
- A computer with Office 2013 Professional Plus, Office 365 Pro Plus, Excel 2013 Standalone, or Office 2010 Professional Plus.
Estimated time to complete: 30 minutes
In this tutorial
Create an Azure Storage account
HDInsight uses Azure Blob Storage for storing data. It is called WASB or Azure Storage - Blob. WASB is Microsoft's implementation of HDFS on Azure Blob storage. For more information see Use Azure Blob storage with HDInsight.
When you provision an HDInsight cluster, you specify an Azure Storage account. A specific Blob storage container from that account is designated as the default file system, just like in HDFS. The HDInsight cluster is by default provisioned in the same data center as the storage account you specify.
In addition to this storage account, you can add additional storage accounts when you custom-configure an HDInsight cluster. This additional storage account can either be from the same Azure subscription or different Azure subscriptions. For instructions, see Provision HDInsight clusters using custom options.
To simplify this tutorial, only the default blob container and the default storage account are used. In practice, the data files are usually stored in a designated storage account.
To create an Azure Storage account
- Sign in to the Azure Management Portal.
Click NEW on the lower left corner, point to DATA SERVICES, point to STORAGE, and then click QUICK CREATE.
Enter URL, LOCATION and REPLICATION, and then click CREATE STORAGE ACCOUNT. Affinity groups are not supported. You will see the new storage account in the storage list.
The quick-create option to provision an HDInsight cluster, like the one we use in this tutorial, does not ask for a location while provisioning the cluster. Instead, it by default co-locates the cluster in the same data center as the storage account. So, make sure you create your storage account in the locations supported for the cluster, which are: East Asia, Southeast Asia, North Europe, West Europe, East US, West US, North Central US, South Central US.
Wait until the STATUS of the new storage account is changed to Online.
Select the new storage account from the list and click MANAGE ACCESS KEYS from the bottom of the page.
Make a note of the STORAGE ACCOUNT NAME and the PRIMARY ACCESS KEY (or the SECONDARY ACCESS KEY. Either of the keys work). You will need them later in the tutorial.
For more information, see How to Create a Storage Account and Use Azure Blob Storage with HDInsight.
Provision an HDInsight cluster
When you provision an HDInsight cluster, you provision Azure compute resources that contains Hadoop and related applications. In this section you provision a HDInsight cluster version 3.1, which is based on Hadoop version 2.4. You can also create Hadoop clusters for other versions using the Azure portal, HDInsight PowerShell cmdlets, or the HDInsight .NET SDK. For instructions, see Provision HDInsight clusters using custom options. For information about different HDInsight versions and their SLA, see HDInsight component versioning page.
To provision an HDInsight cluster
Sign in to the Azure Management Portal.
Click HDInsight on the left to list the status of the clusters in your account. In the following screenshot, there are no existing HDInsight clusters.
Click NEW on the lower left side, click Data Services, click HDInsight, and then click Hadoop.
Enter or select the following values:
|Cluster Name||Name of the cluster|
|Cluster Size||Number of data nodes you want to deploy. The default value is 4. But the option to use 1 or 2 data nodes is also available from the drop-down. Any number of cluster nodes can be specified by using the Custom Create option. Pricing details on the billing rates for various cluster sizes are available. Click the ? symbol just above the dropdown box and follow the link on the pop up.|
|Password||The password for the admin account. The cluster user name "admin" is specified when you are not using the Custom Create option. Note that this is NOT the Windows Administrator account for the VMs on which the clusters are provisioned. The account name can be changed by using the Custom Create wizard.|
|Storage Account||Select the storage account you created from the dropdown box. |
Once a storage account is chosen, it cannot be changed. If the storage account is removed, the cluster will no longer be available for use. The HDInsight cluster is co-located in the same datacenter as the storage account.
Keep a copy of the cluster name. You will need it later in the tutorial.
Click Create HDInsight Cluster. When the provisioning completes, the status column shows Running.
Run a Hive job
Now that you have an HDInsight cluster provisioned, the next step is to run a Hive job to query a sample Hive table, hivesampletable, which comes with HDInsight clusters. The table contains data on mobile device manufacturer, platforms, and models. We query this table to retrieve data for mobile devices by a specific manufacturer.
To run a Hive job from cluster dashboard
- Sign in to the Azure Management Portal.
- Click HDINSIGHT from the left pane. You shall see a list of clusters created, including the one you just created in the last section.
- Click the cluster name where you want to run the Hive job and then click MANAGE CLUSTER from the bottom of the page.
It opens a Web page on a different browser tab. Enter the Hadoop user account and password. The default user name is admin; the password is what you entered while provisioning the cluster. The dashboard looks like :
There are several tabs on the top. The default tab is Hive Editor, while the other tabs are Job History and File Browser. Using the dashboard, you can submit Hive queries, check Hadoop job logs, and browse WASB files.
Note that the URL of the Web page is <ClusterName>.azurehdinsight.net. So, instead of opening the dashboard from the Management portal, you can also open the dashboard from a Web browser using the URL.
On the Hive Editor tab, for Query Name, enter HTC20. The query name is the job title.
In the query pane, enter the following query:
SELECT * FROM hivesampletable
WHERE devicemake LIKE "HTC%"
Click Submit. It takes a few moments to get the results back. The screen refreshes every 30 seconds. You can also click Refresh to refresh the screen.
Once completed, the screen looks like:
Click the query name on the screen to see the output. Make a note of Job Start Time (UTC). You will need it later.
The page also shows the Job Output and the Job Log. You also have the option to download the output file (_stdout) and the log file (_stderr).
The Job Session table on the Hive Editor tab lists completed or running jobs as long as you stay on that tab. The table does not list any jobs if you navigate away from the page. The Job History tab maintains a list of all jobs, completed or running.
To browse to the output file
- From the cluster dashboard, click File Browser at the top.
- Click your storage account name, click your container name (which is the same as your cluster name), and then click user.
Click admin and then click the GUID which has the last modified time a little after the job start time you noted earlier. Make a note of this GUID. You will need it in the next section.
Connect to Microsoft business intelligence tools
You can use the Power Query add-in for Microsoft Excel to import the job output from HDInsight into Excel, where Microsoft Business Intelligence (BI) tools can be used to further analysis of results.
You must have Excel 2010 or 2013 installed to complete this part of the tutorial.
To download Microsoft Power Query for Excel
To import HDInsight data
- Open Excel, and create a new blank workbook.
Click the Power Query menu, click From Other Sources, and then click From Azure HDInsight.
Enter the Account Name of the Azure Blob Storage Account associated with your cluster, and then click OK. This is the storage account you created earlier in the tutorial.
Enter the Account Key for the Azure Blob Storage Account, and then click Save.
In the Navigator pane on the right, double-click the Blob storage container name. By default the container name is the same name as the cluster name.
Locate stdout in the Name column. Verify the GUID in the corresponding Folder Path column matches the GUID you noted down earlier. A match suggests that the output data corresponds to the job you submitted. Click Binary on the left of stdout.
Click Close & Load in the upper left corner to import the Hive job output into Excel.
In this tutorial, you have learned how to provision a cluster with HDInsight, run a MapReduce job on it, and import the results into Excel where they can be further processed and graphically displayed using BI tools. To learn more, see the following articles: