To help you learn Hadoop on Windows and start using HDInsight, this tutorial shows you how to run a Hive query on unstructured data in a Hadoop cluster and then analyze the results in Microsoft Excel.
Assume you have a large unstructured data set and you want to run a Hive query on it to extract some meaningful information. That's exactly what we are going to do in this tutorial. Here's how we achieve this:
Watch a demo video of this tutorial to learn Hadoop on HDInsight:
In conjunction with the general availability of Azure HDInsight, Microsoft also provides HDInsight Emulator for Azure, formerly known as Microsoft HDInsight Developer Preview. The Emulator targets developer scenarios and only supports single-node deployments. For information about using HDInsight Emulator, see Get Started with the HDInsight Emulator.
For instructions on how to provision an HBase cluster, see Provision HBase cluster in HDInsight. See What's the difference between Hadoop and HBase? to understand why you might choose one database over the other.
Before you begin this tutorial for Hadoop on Windows, you must have the following:
Estimated time to complete this tutorial: 30 minutes
When you provision a Hadoop cluster in HDInsight, you specify an Azure Storage account. A specific Blob storage container from that account is designated as the default file system, like in the Hadoop distributed file system (HDFS). By default, the HDInsight cluster is provisioned in the same data center as the storage account you specify. For more information, see Use Azure Blob storage with HDInsight
In addition to this storage account, you can add additional storage accounts when you custom-configure a cluster. This additional storage account can be from the same Azure subscription or from different Azure subscriptions. For instructions, see Provision HDInsight clusters using custom options.
This tutorial uses only the default blob and the default storage account.
To create an Azure Storage account
Click NEW in the lower-left corner and then enter the values as shown in the image.
Make sure you create the storage account in a location that is supported for the cluster. These are: East Asia, Southeast Asia, North Europe, West Europe, East US, West US, North Central US, South Central US.
Select the new storage account from the list and click MANAGE ACCESS KEYS at the bottom of the page. Make a note of the PRIMARY ACCESS KEY (or the SECONDARY ACCESS KEY—either of the keys work). You will need this later in the tutorial. For more information, see How to Create a Storage Account .
When you provision a cluster, you provision Azure compute resources that contain Hadoop and related applications. In this section, you provision an HDInsight version 3.1 cluster, which is based on Hadoop version 2.4. You can also create Hadoop clusters for other versions by using the Azure portal, HDInsight PowerShell cmdlets, or the HDInsight .NET SDK. For instructions, see Provision HDInsight clusters using custom options. For information about HDInsight versions and their SLAs, see HDInsight component versioning.
The steps in this article create an HDInsight cluster by using basic configuration settings. For information about other cluster configuration settings (such as using Azure virtual network or a metastore for Hive and Oozie), see Provision Hadoop clusters in HDInsight by using custom options.
To provision a Hadoop cluster
Sign in to the Azure Portal.
Click NEW in the lower-left corner and then enter the values as shown in the image.
These steps provision an HDInsight cluster with version 3.1. To create cluster with other versions, use the Custom Create method from the portal or use Azure PowerShell. For information about what's different between each version, see What's new in the cluster versions provided by HDInsight?. For information about using the CUSTOM CREATE option, see Provision HDInsight clusters using custom options.
A successfully provisioned HDInsight cluster provides a query console that includes a Getting Started gallery to run samples directly from the portal. You can use the samples to learn how to work with HDInsight by walking through some basic scenarios. These samples come with all the required components, such as the data to analyze and the queries to run on the data. To learn more about the samples in the Getting Started gallery, see Learn Hadoop in HDInsight using the HDInsight Getting Started Gallery.
To run the sample, from the Azure portal, click the cluster name where you want to run the sample, and then click Query Console at the bottom of the page. From the webpage that opens, click the Getting Started Gallery tab, and then under the Samples category, click the sample that you want to run. Follow the instructions on the webpage to finish the sample. The following table lists a couple of samples and provides more information about what each sample does.
|Sample||What does it do?|
|Sensor data analysis||Learn how to use HDInsight to process historical data that is produced by heating, ventilation, and air conditioning (HVAC) systems to identify systems that are not able to reliably maintain a set temperature.|
|Website log analysis||Learn how to use HDInsight to analyze website log files to get insight into the frequency of visits to the website in a day from external websites, and a summary of website errors that the users experience.|
|Twitter trend analysis||Learn how to use HDInsight to analyze trends in Twitter.|
Now that you have provisioned an HDInsight cluster, the next step is to run a Hive job to query a sample Hive table. We will use hivesampletable, which comes with HDInsight clusters. The table contains data about mobile device manufacturers, platforms, and models. A Hive query on this table retrieves data for mobile devices by a specific manufacturer.
HDInsight Tools for Visual Studio comes with the Azure SDK for .NET version 2.5 or later. By using the tools in Visual Studio, you can connect to HDInsight cluster, create Hive tables, and run Hive queries. For more information, see Get started using HDInsight Hadoop Tools for Visual Studio.
To run a Hive job from the cluster dashboard
A webpage opens in a different browser tab. Enter the Hadoop user account and password. The default user name is admin; the password is what you entered while provisioning the cluster. The dashboard looks like this:
There are several tabs at the top of the page. The default tab is Hive Editor, and the other tabs are Job History and File Browser. By using the dashboard, you can submit Hive queries, check Hadoop job logs, and browse files in storage.
On the Hive Editor tab, for Query Name, enter HTC20. The query name is the job title. In the query pane, enter the Hive query as shown in the image:
Click Submit. It takes a few moments to get the results back. The screen refreshes every 30 seconds. You can also click Refresh to refresh the screen.
After the status shows that the job is completed, click the query name on the screen to see the output. Make a note of Job Start Time (UTC). You will need it later.
The page also shows the Job Output and the Job Log. You also have the option to download the output file (_stdout) and the log file (_stderr).
To browse to the output file
Click admin and then click the GUID that has the last modified time (a little after the job start time you noted earlier). Copy this GUID. You will need it in the next section.
You can use the Power Query add-in for Microsoft Excel to import the job output from HDInsight into Excel, where Microsoft business intelligence tools can be used to further analyze the results.
You must have Excel 2013 or 2010 installed to complete this part of the tutorial.
To download Microsoft Power Query for Excel
To import HDInsight data
Click the Power Query menu, click From Other Sources, and then click From Azure HDInsight.
Enter the Account Name of the Azure Blob Storage account that is associated with your cluster, and then click OK. (This is the storage account you created earlier in the tutorial.)
Enter the Account Key for the Azure Blob Storage account, and then click Save.
In the right pane, double-click the blob name. By default the blob name is the same as the cluster name.
Locate stdout in the Name column. Verify that the GUID in the corresponding Folder Path column matches the GUID you copied earlier. A match suggests that the output data corresponds to the job you submitted. Click Binary in the column left of stdout.
Click Close & Load in the upper-left corner to import the Hive job output into Excel.
In this Hadoop tutorial, you learned how to provision a Hadoop cluster on Windows in HDInsight, run a Hive query on data, and import the results into Excel, where they can be further processed and graphically displayed with business intelligence tools. To learn more, see the following tutorials: