Get started using Hadoop 2.2 in HDInsight
HDInsight makes Apache Hadoop available as a service in the cloud. The MapReduce software framework is available in a simpler, more scalable, and more cost-efficient Azure environment. HDInsight also provides a cost-efficient approach to the managing and storing of data using Azure Blob storage.
In this tutorial, you will provision an Hadoop cluster in HDInsight using the Azure Management Portal, submit an Hive job to query against a sample Hive table using the cluster dashboard, and then import the Hive job output data into Excel for examination.
The live demo of this article:
In conjunction with the general availability of Azure HDInsight, Microsoft has also released HDInsight Emulator for Azure, formerly known as Microsoft HDInsight Developer Preview. This product targets developer scenarios and as such only supports single-node deployments. For using HDInsight Emulator, see Get Started with the HDInsight Emulator.
Before you begin this tutorial, you must have the following:
- An Azure subscription. For more information about obtaining a subscription, see Purchase Options, Member Offers, or Free Trial.
- A computer with Office 2013 Professional Plus, Office 365 Pro Plus, Excel 2013 Standalone, or Office 2010 Professional Plus.
Estimated time to complete: 30 minutes
In this tutorial
Provision an HDInsight cluster
HDInsight uses Azure Blob Storage for storing data. It is called WASB or Azure Storage - Blob. WASB is Microsoft's implementation of HDFS on Azure Blob storage. For more information see Use Azure Blob storage with HDInsight.
When provision an HDInsight cluster, an Azure Storage account and a specific Blob storage container from that account is designated as the default file system, just like in HDFS. The storage account must be located in the same data center as the HDInsight compute resources. Currently, you can only provision HDInsight clusters in the following data centers:
- Southeast Asia
- North Europe
- West Europe
- East US
- West US
In addition to this storage account, you can add additional storage accounts from either the same Azure subscription or different Azure subscriptions. For instructions on adding additional storage accounts, see Provision HDInsight clusters.
To simply this tutorial, only the default blob container and the default storage account are used, and all of the files are stored in the default file system container, located at /tutorials/getstarted/. In practice, the data files are usually stored in a designated storage account.
To create an Azure Storage account
- Sign in to the Azure Management Portal.
Click NEW on the lower left corner, point to DATA SERVICES, point to STORAGE, and then click QUICK CREATE.
Enter URL, LOCATION and REPLICATION, and then click CREATE STORAGE ACCOUNT. Affinity groups are not supported. You will see the new storage account in the storage list.
Wait until the STATUS of the new storage account is changed to Online.
Click the new storage account from the list to select it.
Click MANAGE ACCESS KEYS from the bottom of the page.
Make a note of the STORAGE ACCOUNT NAME and the PRIMARY ACCESS KEY (or the SECONDARY ACCESS KEY. Either of the keys works). You will need them later in the tutorial.
For more information, see How to Create a Storage Account and Use Azure Blob Storage with HDInsight.
To provision an HDInsight cluster
Sign in to the Azure Management Portal.
Click HDInsight on the left to list the status of the clusters in your account. In the following screenshot, there is no existing HDInsight cluster.
Click NEW on the lower left side, click Data Services, click HDInsight, and then click Quick Create.
Enter or select the following values:
|Cluster Name||Name of the cluster|
|Cluster Size||Number of data nodes you want to deploy. The default value is 4. But 8, 16 and 32 data node clusters are also available on the dropdown menu. Any number of data nodes may be specified when using the Custom Create option. Pricing details on the billing rates for various cluster sizes are available. Click the ? symbol just above the dropdown box and follow the link on the pop up.|
|Password||The password for the account admin. The cluster user name is specified to be "admin" by default when using the Quick Create option. Please note, this is NOT the Windows Administrator account for the VM. The account name can be changed by using the Custom Create wizard. The password field must be at least 10 characters and must contain an uppercase letter, a lowercase letter, a number, and a special character.|
|Storage Account||Select the storage account you created from the dropdown box. |
Once a storage account is chosen, it cannot be changed. If the storage account is removed, the cluster will no longer be available for use. The HDInsight cluster location will be the same as the storage account.
Keep a copy of the cluster name. You will need it later in the tutorial.
The quick create method creates a HDInsight version 2.1 cluster. To create version 1.6 or 3.0 clusters, use the custom create method from the management portal, or use Azure PowerShell.
Click Create HDInsight Cluster on the lower right. When the provision process completes, the status column will show Running.
For information on using the CUSTOM CREATE option, see Provision HDInsight Clusters.
Run an Hive job
Now you have an HDInsight cluster provisioned. The next step is to run an Hive job to query a sample Hive table that comes with HDInsight clusters. The table name is hivesampletable.
To open cluster dashboard
- Sign in to the Azure Management Portal.
- Click HDINSIGHT from the left pane. You shall see a list of clusters created including the one you just created in the last section.
- Click the cluster name where you want to run the Hive job.
- Click MANAGE CLUSTER from the bottom of the page to open cluster dashboard. It opens a Web page on a different browser tab.
Enter the Hadoop User account username and password. The default username is admin, the password is what you entered during the provision process. The dashboard looks like :
There are several tabs on the top. The default tab is Hive Editor, other tabs include Jobs and Files. Using the dashboard, you can submit Hive queryes, check Hadoop job logs, and browse WASB files.
[wacom.note] Notice is the URL is <ClusterName>.azurehdinsight.net. Instead of opening the dashboard from the Management portal, you can also open the dashboard from a Web browser using the URL.
To run an Hive query
- From HDInsight cluster dashboard, click Hive Editor from the top.
- In Query Name, enter HTC20. The query name is job title.
In the query pane, enter the following query:
SELECT * FROM hivesampletable
WHERE devicemake LIKE "HTC%"
Click Submit. It takes a few moments to get the results back. The screen refreshes every 30 seconds. You can also click Refresh to refresh the screen.
Once completed, the screen looks like:
Make a note of Job Start Time (UTC). You will need it later.
Scroll down a little more, you will see Job Log. Job Output is stdout, Job Log is stderr.
If you want to reopen the log file again in the future, you can click Jobs from the top of the screen, and then click the job title (query name). For example HTC20 in this case.
To browse the output file
- From the cluster dashboard, click Files from the top.
- Click Templeton-Job-Status.
- Click the GUID number which has the last Modified time a little after the Job Start Time you wrote down earlier. Make a note of this GUID. You will need it in the next section.
- The stdout file has the data you need in the next section. You can click stdout to download a copy of the data file if you want.
Connect to Microsoft business intelligence tools
The Power Query add-in for Excel can be used to export output from HDInsight into Excel where Microsoft Business Intelligence (BI) tools can be used to further process or display the results.
You must have Excel 2010 or 2013 installed to complete this part of the tutorial.
To download Microsoft Power Query for Excel
To import HDInsight data
- Open Excel, and create a new blank workbook.
Click the Power Query menu, click From Other Sources, and then click From Azure HDInsight.
Enter the Account Name of the Azure Blob Storage Account associated with your cluster, and then click OK. This is the storage account you created earlier in the tutorial.
Enter the Account Key for the Azure Blob Storage Account, and then click Save.
In the Navigator pane on the right, double-click the Blob storage container name. By default the container name is the same name as the cluster name.
Locate stdout in the Name column (the path is .../Templeton-Job-Status/), and then click Binary on the left of stdout. The must match the one you wrote down in the last section.
Click Apply & Close in the upper left corner. The query then imports the Hive job output into Excel.
In this tutorial, you have learned how to provision a cluster with HDInsight, run a MapReduce job on it, and import the results into Excel where they can be further processed and graphically displayed using BI tools. To learn more, see the following articles: