Use Hive with Hadoop on HDInsight
Apache Hive is a data warehouse system for Hadoop, which enables data summarization, querying, and analysis of data using HiveQL - a SQL-Like query language. Hive can be used to interactively explorer your data, or to create reusable batch processing jobs.
In this article, you will learn how you can use Hive with HDInsight.
Why use Hive?
Hive allows you to project structure on largely unstructured data. Once you define the structure, you can then use Hive to query that data without knowledge of Java or MapReduce. HiveQL, the Hive query language, allows you to write queries with statements that are similar to T-SQL.
Hive understands how to work with structured/semi-structured documents, such as text files where the fields are delimited by a specific characters. Hive also supports custom serializer/deserializers (SerDe) for complex, or irregularly structured data. For more information, see How to use a custom JSON SerDe with HDInsight.
Hive can also be extended through User Defined Functions (UDF). A UDF allows you to implement functionality or logic that isn't easily modeled in HiveQL. For an example of using a UDF with Hive, see Using Python with Hive and Pig in HDInsight and How to add a custom Hive UDF to HDInsight.
About the sample data
This example uses a log4j sample file, which is stored at /example/data/sample.log under your blob storage container. Each log inside the file consists of a line of fields that contains a
[LOG LEVEL] field to show the type and the severity. For example:
2012-02-03 20:26:41 SampleClass3 [ERROR] verbose detail for id 1527353937
In the example above, the log level is ERROR.
The sample data is stored in Azure Blob storage, which HDInsight uses as the default file system. HDInsight can access files stored in blob storage using the wasb prefix. For example, to access the sample.log file, you would use the following syntax:
Since WASB is the default storage for HDInsight, you can also access the file using /example/data/sample.log from HiveQL.
The above syntax, wasb:///, is used to access files stored on the default storage container for your HDInsight cluster. If you specified additional storage accounts when you provisioned your cluster, and want to access files stored on these accounts, you can access the data by specifying the container name and storage account address. For example, wasb://email@example.com/example/data/sample.log.
About the sample job
The following HiveQL statements will be used to project columns onto delimited data stored in the wasb:///example/data directory.
DROP TABLE log4jLogs;
CREATE EXTERNAL TABLE log4jLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION 'wasb:///example/data/';
SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs WHERE t4 = '[ERROR]' GROUP BY t4;
The HiveQL statements perform the following actions
- DROP TABLE - deletes the table and the data file, in case the table already exists
- CREATE EXTERNAL TABLE - creates a new 'external' table in Hive. External tables only store the table definition in Hive - the data is left in the original location, in the original format
- ROW FORMAT - tells Hive how the data is formatted. In this case, the fields in each log are separated by a space
- STORED AS TEXTFILE LOCATION - tells Hive where the data is stored (the example/data directory,) and that it is stored as text. The data can be in one file, or spread across multiple files within the directory
- SELECT - select a count of all rows where column t4 contain the value [ERROR]. This should return a value of 3 as there are three rows that contain this value
External tables should be used when you expect the underlying data to be updated by an external source, such as an automated data upload process, or by another MapReduce operation, but always want Hive queries to use the latest data.
Dropping an external table does not delete the data, only the table definition.
After creating the external table, the following statements will be used to create an internal table.
CREATE TABLE IF NOT EXISTS errorLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string)
STORED AS ORC;
INSERT OVERWRITE TABLE errorLogs
SELECT t1, t2, t3, t4, t5, t6, t7 FROM log4jLogs WHERE t4 = '[ERROR]';
These statements perform the following actions.
- CREATE TABLE IF NOT EXISTS - creates a table, if it does not already exist. Since the EXTERNAL keyword is not used, this is an 'internal' table, which is stored in the Hive data warehouse and is managed completely by Hive
- STORED AS ORC - stores the data in Optimized Row Columnar (ORC) format. This is a highly optimized and efficient format for storing Hive data
- INSERT OVERWRITE ... SELECT - selects rows from the log4jLogs table that contain [ERROR], then insert the data into the errorLogs table
Unlike EXTERNAL tables, dropping an internal table will delete the underlying data as well.
Using Tez For Improved Performance
Apache Tez is a framework that allows for data intensive applications like Hive to execute much more efficiently at scale. In the latest release of HDInsight, Hive now supports running on Tez. This is currently off by default and must be enabled. In future cluster versions, this will be set to be on by default. In order to take advantage of Tez, the following value must be set for a Hive query:
This can submitted on a per query basis by placing this at the beginning of your query. One can also set this to be on by default on a cluster by setting the configuration value at cluster creation time. You can find more details in Provisioning HDInsight Clusters.
The Hive on Tez design documents contain a number of details on the implementation choices and tuning configurations.
Run the HiveQL job
HDInsight can run HiveQL jobs using a variety of methods. Use the following table to decide which method is right for you, then follow the link for a walkthrough.
Now that you have learned how to use Hive with HDInsight, use the links below to explore other ways to work with Azure HDInsight.