Get started using Hadoop 2.2 in HDInsight
HDInsight makes Apache Hadoop, a MapReduce software framework, available in a simpler, more scalable, and more cost-efficient Azure environment. HDInsight also provides a cost-efficient approach to the managing and storing of data using Azure Blob storage.
In conjunction with the general availability of Azure HDInsight, Microsoft also provides HDInsight Emulator for Azure, formerly known as Microsoft HDInsight Developer Preview. The Emulator targets developer scenarios and only supports single-node deployments. For using HDInsight Emulator, see Get Started with the HDInsight Emulator.
What does this tutorial achieve?
Assume you have a large unstructured data set and you want to run queries on it to extract some meaningful information. That's exactly what we are going to do in this tutorial. Here's how we achieve this:
You can also watch a demo video of this tutorial:
Before you begin this tutorial, you must have the following:
- An Azure subscription. For more information about obtaining a subscription, see Purchase Options, Member Offers, or Free Trial.
- A computer with Office 2013 Professional Plus, Office 365 Pro Plus, Excel 2013 Standalone, or Office 2010 Professional Plus.
Estimated time to complete: 30 minutes
In this tutorial
Create an Azure Storage account
HDInsight uses Azure Blob Storage for storing data. It is called WASB or Azure Storage - Blob. WASB is Microsoft's implementation of HDFS on Azure Blob storage. For more information see Use Azure Blob storage with HDInsight.
When you provision an HDInsight cluster, you specify an Azure Storage account. A specific Blob storage container from that account is designated as the default file system, just like in HDFS. The HDInsight cluster is by default provisioned in the same data center as the storage account you specified.
In addition to this storage account, you can add additional storage accounts when you custom-configure an HDInsight cluster. This additional storage account can either be from the same Azure subscription or different Azure subscriptions. For instructions, see Provision HDInsight clusters using custom options.
To simplify this tutorial, only the default blob container and the default storage account are used. In practice, the data files are usually stored in a designated storage account.
To create an Azure Storage account
- Sign in to the Azure Management Portal.
Click NEW on the lower left corner, point to DATA SERVICES, point to STORAGE, and then click QUICK CREATE.
Enter URL, LOCATION and REPLICATION, and then click CREATE STORAGE ACCOUNT. Affinity groups are not supported. You will see the new storage account in the storage list.
An HDInsight cluster and the associated Azure storage account must be in the same datacenter. So, make sure you create your storage account in the locations supported for the cluster, which are: East Asia, Southeast Asia, North Europe, West Europe, East US, West US, North Central US, South Central US.
Wait until the STATUS of the new storage account is changed to Online.
Select the new storage account from the list and click MANAGE ACCESS KEYS from the bottom of the page.
Make a note of the STORAGE ACCOUNT NAME and the PRIMARY ACCESS KEY (or the SECONDARY ACCESS KEY. Either of the keys work). You will need them later in the tutorial.
For more information, see How to Create a Storage Account and Use Azure Blob Storage with HDInsight.
Provision an HDInsight cluster
When you provision an HDInsight cluster, you provision Azure compute resources that contain Hadoop and related applications. In this section you provision a HDInsight cluster version 3.0, which is based on Hadoop version 2.2. If you want to provision an HDInsight cluster with Hadoop version 2.4, click on the specific version tab at the beginning of this article. You can also create Hadoop clusters for other versions using HDInsight PowerShell cmdlets or by using the HDInsight .NET SDK. For instructions, see Provision HDInsight clusters using custom options. For information about different HDInsight versions and their SLA, see HDInsight component versioning page.
To provision an HDInsight cluster
Sign in to the Azure Management Portal.
Click HDInsight on the left to list the status of the clusters in your account. In the following screenshot, there are no existing HDInsight clusters.
Click NEW on the lower left side, click Data Services, click HDInsight, and then click Custom Create.
Enter or select the values as shown in the image above and then click the right arrow.
On the Configure Cluster page, enter or select the following values:
|Data nodes||Number of data nodes you want to deploy. For testing purposes, create a single node cluster. |
The cluster size limit varies for Azure subscriptions. Contact Azure billing support to increase the limit.
|Region/Virtual network||Choose the same region as the storage account you created in the last procedure. HDInsight requires the storage account located in the same region. Later in the configuration, you can only choose a storage account that is in the same region as you specified here. The available regions are: East Asia, Southeast Asia, North Europe, West Europe, East US, West US, North Central US, South Central US|
Click the right arrow.
On the Configure Cluster User page, provide the following values:
|User name ||Specify the HDInsight cluster user name.|
|Password/Confirm Password ||Specify the HDInsight cluster user password.|
|Enter Hive/Oozie Metastore ||Select this checkbox to specify a SQL database on the same data center as the cluster, to be used as the Hive/Oozie metastore. This is useful if you want to retain the metadata about Hive/Oozie jobs even after a cluster has been deleted.|
|Metastore Database ||Specify the Azure SQL database that will be used as the metastore for Hive/OOzie. This SQL database must be in the same data center as the HDInsight cluster. The list box only lists the SQL databases in the same data center as you specified on the Cluster Details page.|
|Database user ||Specify the SQL database user that will be used to connect to the database.|
|Database user password ||Specify the SQL database user password.|
The Azure SQL database used for the metastore must allow connectivity to other Azure services, including Azure HDInsight. On the Azure SQL database dashboard, on the right side click the server name. This is the server on which the SQL database instance is running. Once you are on the server view, click Configure, and then for Windows Azure Services, click Yes, and then click Save.
Click the right arrow.
On the Storage Account page, provide the following value:
|Storage Account ||Specify the Azure storage account that will be used as the default file system for the HDInsight cluster. You can choose one of the three options: |
- Use Existing Storage
- Create New Storage
- Use Storage From Another Subscription
|Account Name |
- If you chose to use existing storage, for Account name, select an exising storage account. The drop-down only lists the storage accounts located in the same data center where you chose to provision the cluster.
- If you chose Create new storage or Use storage from another subscription option, you must provide the storage account name.
|Account Key ||If you chose the Use Storage From Another Subscription option, specify the account key for that storage account.|
|Default container |
Specifies the default container on the storage account that is used as the default file system for the HDInsight cluster. If you chose Use Existing Storage for the Storage Account field, and there are no existing containers in that account, the container is created by default with a the same name as the cluster name. If a container with the name of the cluster already exists, a sequence number will be appended to the container name. For example, mycontainer1, mycontainer2, and so on. However, if the existing storage account has a container with a name different from the cluster name you specified, you can use that container as well.
|Additional Storage Accounts ||HDInsight supports multiple storage accounts. There is no limit on the additional storage account that can be used by a cluster. However, if you create a cluster using the Management Portal, you have a limit of seven due to the UI constraints. Each additional storage account you specify adds an extra Storage Account page to the wizard where you can specify the account information. For example, in the screenshot above, one additional storage account is selected, and hence page 5 is added to the dialog.|
If you opted for additional storage accounts, click the right arrow. If not, click the check mark to start provisioning the cluster. When the provisioning completes, the status column shows Running.
On the Storage Account page, enter the account information for the additional storage account if you opted for it:
Here again, you have the option to choose from existing storage, create new storage, or use storage from another Azure subscription. The procedure to provide the values is similar to the previous step.
Click the check mark to start provisioning the cluster. When the provisioning completes, the status column shows Running.
Run a Hive job
Now that you have an HDInsight cluster provisioned, the next step is to run a Hive job to query a sample Hive table, hivesampletable, which comes with HDInsight clusters. The table contains data on mobile device manufacturer, platforms, and models. We query this table to retrieve data for mobile devices by a specific manufacturer.
To run a Hive job from cluster dashboard
- Sign in to the Azure Management Portal.
- Click HDINSIGHT from the left pane. You shall see a list of clusters created, including the one you just created in the last section.
- Click the cluster name where you want to run the Hive job and then click MANAGE CLUSTER from the bottom of the page.
It opens a Web page on a different browser tab. Enter the Hadoop user account and password. The default user name is admin; the password is what you entered while provisioning the cluster. The dashboard looks like :
There are several tabs on the top. The default tab is Hive Editor, while the other tabs are Job History and File Browser. Using the dashboard, you can submit Hive queries, check Hadoop job logs, and browse WASB files.
Note that the URL of the Web page is <ClusterName>.azurehdinsight.net. So, instead of opening the dashboard from the Management portal, you can also open the dashboard from a Web browser using the URL.
On the Hive Editor tab, for Query Name, enter HTC20. The query name is the job title.
In the query pane, enter the following query:
SELECT * FROM hivesampletable
WHERE devicemake LIKE "HTC%"
Click Submit. It takes a few moments to get the results back. The screen refreshes every 30 seconds. You can also click Refresh to refresh the screen.
Once completed, the screen looks like:
Click the query name on the screen to see the output. Make a note of Job Start Time (UTC). You will need it later.
The page also shows the Job Output and the Job Log. You also have the option to download the output file (_stdout) and the log file (_stderr).
The Job Session table on the Hive Editor tab lists completed or running jobs as long as you stay on that tab. The table does not list any jobs if you navigate away from the page. The Job History tab maintains a list of all jobs, completed or running.
To browse to the output file
- From the cluster dashboard, click File Browser at the top.
- Click your storage account name, click your container name (which is the same as your cluster name), and then click user.
Click admin and then click the GUID number which has the last modified time a little after the job start time you noted earlier. Make a note of this GUID. You will need it in the next section.
Connect to Microsoft business intelligence tools
You can use the Power Query add-in for Microsoft Excel to import the job output from HDInsight into Excel, where Microsoft Business Intelligence (BI) tools can be used to further analysis of results.
You must have Excel 2010 or 2013 installed to complete this part of the tutorial.
To download Microsoft Power Query for Excel
To import HDInsight data
- Open Excel, and create a new blank workbook.
Click the Power Query menu, click From Other Sources, and then click From Azure HDInsight.
Enter the Account Name of the Azure Blob Storage Account associated with your cluster, and then click OK. This is the storage account you created earlier in the tutorial.
Enter the Account Key for the Azure Blob Storage Account, and then click Save.
In the Navigator pane on the right, double-click the Blob storage container name. By default the container name is the same name as the cluster name.
Locate stdout in the Name column. Verify the GUID in the corresponding Folder Path column matches the GUID you noted down earlier. Click Binary on the left of stdout.
Click Close & Load in the upper left corner to import the Hive job output into Excel.
In this tutorial, you have learned how to provision a cluster with HDInsight, run a MapReduce job on it, and import the results into Excel where they can be further processed and graphically displayed using BI tools. To learn more, see the following articles: