Get started using Hadoop with Hive in HDInsight
To get you started quickly using HDInsight, this tutorial shows you how to run a Hive query to extract meaningful information about mobile handset use from unstructured data in a Hadoop cluster. Then, you’ll analyze the results in Microsoft Excel.
In conjunction with the general availability of Azure HDInsight, Microsoft also provides HDInsight Emulator for Azure, formerly known as Microsoft HDInsight Developer Preview. The Emulator targets developer scenarios and only supports single-node deployments. For information about using HDInsight Emulator, see Get Started with the HDInsight Emulator.
What does this tutorial achieve?
Assume you have a large unstructured data set and you want to run queries on it to extract some meaningful information. That's exactly what we are going to do in this tutorial. Here's how we achieve this:
You can also watch a demo video of this tutorial:
Before you begin this tutorial, you must have the following:
- An Azure subscription. For more information about obtaining a subscription, see Purchase Options, Member Offers, or Free Trial.
- A computer with Office 2013 Professional Plus, Office 365 Pro Plus, Excel 2013 Standalone, or Office 2010 Professional Plus.
Estimated time to complete this tutorial: 30 minutes
In this tutorial
Create an Azure Storage account
HDInsight uses Azure Blob Storage for storing data. For more information, see Use Azure Blob storage with HDInsight.
When you provision an HDInsight cluster, you specify an Azure Storage account. A specific blob container from that account is designated as the default file system, like in the Hadoop distribute file system (HDFS). By default, the HDInsight cluster is provisioned in the same datacenter as the storage account you specify.
Don't share a default Blob storage container with multiple HDInsight clusters.
In addition to this storage account, you can add additional storage accounts when you custom configure an HDInsight cluster. This additional storage account can be from the same Azure subscription or from different Azure subscriptions. For instructions, see Provision HDInsight clusters using custom options.
To simplify this tutorial, only the default blob and the default storage account are used. In practice, the data files are usually stored in a designated storage account.
To create an Azure Storage account
- Sign in to the Azure Portal.
Click NEW in the lower-left corner, point to DATA SERVICES, point to STORAGE, and then click QUICK CREATE.
Enter URL, LOCATION, and REPLICATION information, and then click CREATE STORAGE ACCOUNT. (Affinity groups are not supported.) You will see the new storage account in the storage list.
The QUICK CREATE option to provision an HDInsight cluster (like the one we use in this tutorial) does not ask for a location when you are provisioning the cluster. Instead, by default, it co-locates the cluster in the same datacenter as the storage account. So, make sure you create your storage account in a location that is supported for the cluster. These are: East Asia, Southeast Asia, North Europe, West Europe, East US, West US, North Central US, South Central US.
Wait until the STATUS of the new storage account changes to Online.
Select the new storage account from the list and click MANAGE ACCESS KEYS at the bottom of the page.
Make a note of the STORAGE ACCOUNT NAME and the PRIMARY ACCESS KEY (or the SECONDARY ACCESS KEY—either of the keys work). You will need them later in the tutorial.
For more information, see How to Create a Storage Account and Use Azure Blob Storage with HDInsight.
Provision an HDInsight cluster
When you provision an HDInsight cluster, you provision Azure compute resources that contain Hadoop and related applications. In this section, you provision an HDInsight version 3.1 cluster, which is based on Hadoop version 2.4. You can also create Hadoop clusters for other versions by using the Azure portal, HDInsight PowerShell cmdlets, or the HDInsight .NET SDK. For instructions, see Provision HDInsight clusters using custom options. For information about HDInsight versions and their SLAs, see HDInsight component versioning.
To provision an HDInsight cluster
Sign in to the Azure Portal.
Click HDInsight in the left pane to list the status of the clusters in your account. In the following screenshot, there are no existing HDInsight clusters.
Click NEW in the lower-left corner, click Data Services, click HDInsight, and then click Hadoop.
Enter or select the following values:
|Cluster Name||Name of the cluster.|
|Cluster Size||Number of data nodes you want to deploy. The default value is 4. But the option to use 1 or 2 data nodes is also available from the drop-down list. Any number of cluster nodes can be specified by using the Custom Create option. Pricing details about the billing rates for various cluster sizes are available. Click the ? symbol above the drop-down list and follow the link that appears.|
|Password||The password for the admin account. The cluster user name "admin" is specified when you are not using the Custom Create option. Note that this is NOT the Windows Administrator account for the VMs on which the clusters are provisioned. The account name can be changed by using the Custom Create wizard.|
|Storage Account||Click the drop-down list, and select the storage account that you created. |
When a storage account is chosen, it cannot be changed. If the storage account is removed, the cluster will no longer be available for use. The HDInsight cluster is located in the same datacenter as the storage account.
Keep a copy of the cluster name. You will need it later in the tutorial.
Click Create HDInsight Cluster. When the provisioning completes, the status column shows Running.
Run samples from the portal
A successfully provisioned HDInsight cluster provides a query console that includes a Getting Started gallery to run samples directly from the portal. You can use the samples to learn how to work with HDInsight by walking through some basic scenarios. These samples come with all the required components, such as the data to analyze and the queries to run on the data. To learn more about the samples in the Getting Started gallery, see Learn Hadoop in HDInsight using the HDInsight Getting Started Gallery.
To run the sample, from the Azure portal, click the cluster name where you want to run the sample, and then click Query Console at the bottom of the page. From the webpage that opens, click the Getting Started Gallery tab, and then under the Samples category, click the sample that you want to run. Follow the instructions on the webpage to finish the sample. The following table lists a couple of samples and provides more information about what each sample does.
|Sample ||What does it do? |
|Sensor data analysis ||Learn how to use HDInsight to process historical data that is produced by heating, ventilation, and air conditioning (HVAC) systems to identify systems that are not able to reliably maintain a set temperature. |
|Website log analysis ||Learn how to use HDInsight to analyze website log files to get insight into the frequency of visits to the website in a day from external websites, and a summary of website errors that the users experience. |
|Twitter trend analysis ||Learn how to use HDInsight to analyze trends in Twitter. |
Run a HIVE query from the portal
Now that you have provisioned an HDInsight cluster, the next step is to run a Hive job to query a sample Hive table. We will use hivesampletable, which comes with HDInsight clusters. The table contains data about mobile device manufacturers, platforms, and models. We query this table to retrieve data for mobile devices by a specific manufacturer.
HDInsight Tools for Visual Studio comes with the Azure SDK for .NET version 2.5 or later. By using the tools in Visual Studio, you can connect to HDInsight cluster, create Hive tables, and run Hive queries. For more information, see Get started using HDInsight Hadoop Tools for Visual Studio.
To run a Hive job from the cluster dashboard
- Sign in to the Azure Portal.
- Click HDINSIGHT from the left pane. You will see a list of clusters, including the cluster you just created in the previous section.
- Click the name of the cluster that you want to use to run the Hive job, and then click QUERY CONSOLE at the bottom of the page.
A webpage opens in a different browser tab. Enter the Hadoop user account and password. The default user name is admin; the password is what you entered while provisioning the cluster. The dashboard looks like this:
There are several tabs at the top of the page. The default tab is Hive Editor, and the other tabs are Job History and File Browser. By using the dashboard, you can submit Hive queries, check Hadoop job logs, and browse files in storage.
Note that the URL of the webpage is <ClusterName>.azurehdinsight.net. So instead of opening the dashboard from the portal, you can open the dashboard from a web browser by using the URL.
On the Hive Editor tab, for Query Name, enter HTC20. The query name is the job title.
In the query pane, enter the following query:
SELECT * FROM hivesampletable
WHERE devicemake LIKE "HTC%"
Click Submit. It takes a few moments to get the results back. The screen refreshes every 30 seconds. You can also click Refresh to refresh the screen.
When the job completes, the screen looks like this:
Click the query name on the screen to see the output. Make a note of Job Start Time (UTC). You will need it later.
The page also shows the Job Output and the Job Log. You also have the option to download the output file (_stdout) and the log file (_stderr).
The Job Session table on the Hive Editor tab lists completed or running jobs if you stay on that tab. The table does not list any jobs if you navigate away from the page. The Job History tab maintains a list of all jobs, completed or running.
To browse to the output file
- On the cluster dashboard, click File Browser.
- Click your storage account name, click your container name (which is the same as your cluster name), and then click user.
Click admin and then click the GUID that has the last modified time (a little after the job start time you noted earlier). Copy this GUID. You will need it in the next section.
Connect to Microsoft business intelligence tools
You can use the Power Query add-in for Microsoft Excel to import the job output from HDInsight into Excel, where Microsoft business intelligence tools can be used to further analyze the results.
You must have Excel 2013 or 2010 installed to complete this part of the tutorial.
To download Microsoft Power Query for Excel
To import HDInsight data
- Open Excel, and create a new workbook.
Click the Power Query menu, click From Other Sources, and then click From Azure HDInsight.
Enter the Account Name of the Azure Blob Storage account that is associated with your cluster, and then click OK. (This is the storage account you created earlier in the tutorial.)
Enter the Account Key for the Azure Blob Storage account, and then click Save.
In the right pane, double-click the blob name. By default the blob name is the same as the cluster name.
Locate stdout in the Name column. Verify that the GUID in the corresponding Folder Path column matches the GUID you copied earlier. A match suggests that the output data corresponds to the job you submitted. Click Binary in the column left of stdout.
Click Close & Load in the upper-left corner to import the Hive job output into Excel.
In this tutorial, you learned how to provision a cluster with HDInsight, run a MapReduce job on it, and import the results into Excel, where they can be further processed and graphically displayed by using business intelligence tools. To learn more, see the following articles: