Use Hive with Hadoop in HDInsight
Apache Hive provides a means of running MapReduce job through an SQL-like scripting language, called HiveQL. Hive is a data warehouse system for Hadoop, which enables data summarization, querying, and analysis of large volumes of data. In this article, you use HiveQL to query a sample data file that is provided as part of HDInsight cluster provisioning.
In this article
The Hive usage case
Hive projects structure on a largely unstructured data, and then lets you query that data. Hive provides a layer of abstraction over the Java-based MapReduce framework, enabling users to query data without knowledge of Java or MapReduce. HiveQL, the Hive query language, allows you to write queries with statements that are similar to T-SQL. HiveQL queries are compiled to MapReduce for you by HDInsight and run on the cluster. Other benefits of Hive are:
- Hive allows programmers who are familiar with the MapReduce framework to plug in custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the HiveQL language.
- Hive is best suited for the batch processing of large amounts of immutable data (such as web logs). It is not appropriate for transaction applications that need very fast response times, such as database management systems.
- Hive is optimized for scalability (more machines can be added dynamically to the Hadoop cluster), extensibility (within the MapReduce framework and with other programming interfaces), and fault-tolerance. Latency is not a key design consideration.
Upload data for Hive tables
HDInsight uses an Azure Blob storage container as the default file system for Hadoop clusters. Some sample data files are added to the blob storage as part of cluster provisioning. This article uses a log4j sample file that is distributed with the HDInsight cluster and is stored at /example/data/sample.log under your blob storage container. Each log inside the file consists of a line of fields that contains a
[LOG LEVEL] field to show the type and the severity. For example:
2012-02-03 20:26:41 SampleClass3 [ERROR] verbose detail for id 1527353937
In the example above, the log level is ERROR.
HDInsight can access files stored in blob storage using the wasb prefix. For example, to access the sample.log file, you would use the following syntax:
Since WASB is the default storage for HDInsight, you can also access the file using /example/data/sample.log.
The above syntax, wasb:///, is used to access files stored on the default storage container for your HDInsight cluster. If you specified additional storage accounts when you provisioned your cluster, and want to access files stored on these accounts, you can access the data by specifying the container name and storage account address. For example, wasb://email@example.com/example/data/sample.log.
Run Hive queries using PowerShell
Hive queries can be run in PowerShell either by using the Start-AzureHDInsightJob cmdlet or the Invoke-Hive cmdlet.
Start-AzureHDInsightJob is a generic job runner, used to start Hive, Pig, and MapReduce jobs on an HDInsight cluster. Start-AzureHDInsightJob is asynchronous, and returns before the job has completed. Information about the job is returned, and can be used with cmdlets such as Wait-AzureHDInsightJob, Stop-AzureHDInsightJob, and Get-AzureHDInsightJobOutput. Get-AzureHDInsightJobOutput must be used to retrieve information written to STDOUT or STDERR by the job.
Invoke-Hive runs a Hive query, waits on it to complete, and retrieves the output of the query as one action.
Open an Azure PowerShell console windows. The instructions can be found in Install and configure Azure PowerShell.
Run the following command to connect to your Azure subscription:
You will be prompted to enter your Azure account credentials.
Set the variables in the following script and run it.
# Provide Azure subscription name, and the Azure Storage account and container that is used for the default HDInsight file system.
$subscriptionName = "<SubscriptionName>"
$storageAccountName = "<AzureStorageAccountName>"
$containerName = "<AzureStorageContainerName>"
# Provide HDInsight cluster name Where you want to run the Hive job
$clusterName = "<HDInsightClusterName>"
Run the following script to create a new table named log4jLogs using the sample data.
# Create an EXTERNAL table
$queryString = "DROP TABLE log4jLogs;" +
"CREATE EXTERNAL TABLE log4jLogs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasb:///example/data/';" +
"SELECT t4 AS sev, COUNT(*) AS cnt FROM log4jLogs WHERE t4 = '[ERROR]' GROUP BY t4;"
The HiveQL statements perform the following actions
- DROP TABLE - deletes the table and the data file, in case the table already exists
- CREATE EXTERNAL TABLE - creates a new 'external' table in Hive. External tables only store the table definition in Hive - the data is left in the original location
- ROW FORMAT - tells Hive how the data is formatted. In this case, the fields in each log are separated by a space
- STORED AS TEXTFILE LOCATION - tells Hive where the data is stored (the example/data directory,) and that it is stored as text
- SELECT - select a count of all rows where column t4 contain the value [ERROR]. This should return a value of 3 as there are three rows that contain this value
External tables should be used when you expect the underlying data to be updated by an external source, such as an automated data upload process, or by another MapReduce operation, but always want Hive queries to use the latest data.
Dropping an external table does not delete the data, only the table definition.
Run the following script to create a Hive job definition from the previous query.
# Create a Hive job definition
$hiveJobDefinition = New-AzureHDInsightHiveJobDefinition -Query $queryString
You can also use the -File switch to specify a HiveQL script file on HDFS.
Run the following script to submit the Hive job:
# Submit the job to the cluster
$hiveJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $hiveJobDefinition
Run the following script to wait for the Hive job to complete:
# Wait for the Hive job to complete
Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600
Run the following script to print the standard output:
# Print the standard error and the standard output of the Hive job.
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId -StandardOutput
The result is:
which means there were three instances of ERROR logs in the sample.log file.
To use Invoke-Hive, you must first set the cluster to use.
# Connect to the cluster
Use the following script to create a new 'internal' table named errorLogs using the Invoke-Hive cmdlet.
# Run a query to create an 'internal' Hive table
$response = Invoke-Hive -Query @"
CREATE TABLE IF NOT EXISTS errorLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) STORED AS ORC;
INSERT OVERWRITE TABLE errorLogs SELECT t1, t2, t3, t4, t5, t6, t7 FROM log4jLogs WHERE t4 = '[ERROR]';
# print the output on the console
These statements perform the following actions.
- CREATE TABLE IF NOT EXISTS - creates a table, if it does not already exist. Since the EXTERNAL keyword is not used, this is an 'internal' table, which is stored in the Hive data warehouse and is managed completely by Hive
- STORED AS ORC - stores the data in Optimized Row Columnar (ORC) format. This is a highly optimized and efficient format for storing Hive data
- INSERT OVERWRITE ... SELECT - selects rows from the log4jLogs table that contain [ERROR], then insert the data into the errorLogs table
Unlike EXTERNAL tables, dropping an internal table will delete the underlying data as well.
The output will look like the following.
For longer HiveQL queries, you can use PowerShell Here-Strings or HiveQL script files. The following snippet shows how to use the Invoke-Hive cmdlet to run a HiveQL script file. The HiveQL script file must be uploaded to WASB.
Invoke-Hive -File "wasb://<ContainerName>@<StorageAccountName>/<Path>/query.hql"
For more information about Here-Strings, see Using Windows PowerShell Here-Strings.
To verify that only rows containing [ERROR] in column t4 were stored to the errorLogs table, use the following statement to return all the rows from errorLogs.
#Select all rows
$response = Invoke-Hive -Query "SELECT * from errorLogs;"
Three rows of data should be returned, all containing [ERROR] in column t4.
Run Hive queries using Visual Studio
HDInsight Tools for Visual Studio comes with Azure SDK for .NET version 2.5 or later. Using the tools from Visual Studio, you can connect to HDInsight cluster, create Hive tables, and run Hive queries. For more information see Get started using HDInsight Hadoop Tools for Visual Studio.
Using Tez For Improved Performance
Apache Tez is a framework that allows for data intensive applications like Hive to execute much more efficiently at scale. In the latest release of HDInsight, Hive now supports running on Tez. This is currently off by default and must be enabled. In future cluster versions, this will be set to be on by default. In order to take advantage of Tez, the following value must be set for a Hive query:
This can submitted on a per query basis by placing this at the beginning of your query. One can also set this to be on by default on a cluster by setting the configuration value at cluster creation time. You can find more details in Provisioning HDInsight Clusters.
The Hive on Tez design documents contain a number of details on the implementation choices and tuning configurations.
While Hive makes it easy to query data using a SQL-like query language, other components available with HDInsight provide complementary functionality such as data movement and transformation. To learn more, see the following articles: