A UDF that converts various date/time formats to Hive timestamp format

podle rban1
Poslední aktualizace: 24. 8. 2016
Upravit na GitHubu

A basic Java-based User Defined Function (UDF) for Hive. This UDF converts various timestamps(SQL,ISO,.NET formats) into the Hive Timestamp format.

The various formats that it converts are:

  • y/m/d
  • y/d/m
  • m/d/y
  • m/y/d
  • d/y/m
  • d/m/y
  • yyyy-mm-ddthh:mm:ss[.mmm]
  • YYYYMMDD[ hh:mm:ss[.mmm]]

About the code

The file timestampconv.java uses regex to convert various different timestamp formats, and extracts the relavant information from it. This information is then parsed into the Hive timestamp format.

The file timestampconvTest.java contains the JUNIT test cases which, validate the various formats that this example handles.

Build the sample

NOTE: This sample is built using Apache Maven, and also requires Java 1.7 or higher.

  1. Clone or download this project.

  2. From a command-line in the project directory, use the following to build, test, and package the project.

    mvn clean package

Once the process completes, a file named hiveudf-1.0-SNAPSHOT.jar can be found in the target directory.

Deploy the sample to HDInsight

Copy the hiveudf-1.0-SNAPSHOT.jar file to your HDInsight cluster. There are a variety of methods that can be used to do this. See Upload data for Hadoop jobs for information on uploading the file directly to the storage account used by HDInsight.

You can also simply use SCP to upload the data to the head node of a Linux-based cluster, then connect with SSH and use the hdfs dfs -put command to copy the data to storage for the cluster.

Ideally, you want the file in the default storage for the cluster; this way, it is accessible from all nodes in the cluster and is persisted when you delete the cluster.

Use the sample

The following example assumes the following:

  • The hiveudf-1.0-SNAPSHOT.jar has been stored into the /example/jars folder in the default storage account for the cluster.

  • The hiveudf1 table contains a column named a1 that contains a timestamp value that needs to be converted to Hive timestamp format.

    add jar wasb:///example/jars/hiveudf-1.0-SNAPSHOT.jar; CREATE TEMPORARY FUNCTION timeconv AS 'com.microsoft.example.timestampconv'; select cast (timeconv(a1,"yyyy-mm-ddthh:mm:ss[.mmm]") as timestamp) from hiveudf1;