Windows Azure HDInsight provides the capability to dynamically provision clusters running Apache Hadoop to process Big Data. You can find more information here in the initial blog post for this series, and you can click here to get started using it in the Windows Azure portal. This post enumerates the different ways for a developer to interact with HDInsight, first by discussing the different scenarios and then diving into the variety of capabilities in HDInsight. As we are built on top of Apache Hadoop, there is a broad and rich ecosystem of tools and capabilities that one can leverage.
In terms of scenarios, as we’ve worked with customers, there are really two distinct scenarios, authoring jobs where one is using the tool to process big data, and integrating HDInsight with your application where the input and output of jobs are incorporated as part of a larger application architecture. One key design aspect of HDInsight is the integration with Windows Azure Blob Storage as the default file system. What this means is that in order to interact with data, you can use existing tools and API’s for accessing data in blob storage. This blog post goes into more detail on our utilization of Blob Storage.
Within the context of authoring jobs, there is a wide array of tools available. From a high level, there are a set of tools that are part of the existing Hadoop ecosystem, a set of projects we’ve built to get .NET developers started with Hadoop, and work we’ve begun to leverage JavaScript for interacting with Hadoop.
Job Authoring
Existing Hadoop Tools
As HDInsight leverages Apache Hadoop via the Hortonworks Data Platform, there is a high degree of fidelity with the Hadoop ecosystem. As such, many capabilities will work “as-is.” This means that investments and knowledge in any of the following tools will work in HDInsight. Clusters are created with the following Apache projects for distributed processing:
- Map/Reduce
- Map/Reduce is the foundation of distributed processing in Hadoop. One can write jobs either in Java or can leverage other languages and runtimes through the use of Hadoop Streaming.
- A simple guide to writing Map/Reduce jobs on HDInsight is available here.
- Hive
- Hive uses a syntax similar to SQL to express queries that compile to a set of Map/Reduce programs. Hive has support for many of the constructs that one would expect in SQL (aggregation, groupings, filtering, etc.), and easily parallelizes across the nodes in your cluster.
- A guide to using Hive is here
- Pig
- Pig is a dataflow language that compiles to a series of Map/Reduce programs using a language called Pig Latin.
- A guide to getting started with Pig on HDInsight is here.
- Oozie
- Oozie is a workflow scheduler for managing a directed acyclic graph of actions, where actions can be Map/Reduce, Pig, Hive or other jobs. You can find more details in the quick start guide here.
You can find an updated list of Hadoop components here. The table below represents the versions for the current preview:
Apache Hadoop | 1.0.3 |
Apache Hive | 0.9.0 |
Apache Pig | 0.9.3 |
Apache Sqoop | 1.4.2 |
Apache Oozie | 3.2.0 |
Apache HCatalog | 0.4.1 |
Apache Templeton | 0.1.4 |
Additionally, other projects in the Hadoop space, such as Mahout (see this sample) or Cascading can easily be used on top of HDInsight. We will be publishing additional blog post on these topics in the future.
.NET Tooling
We’re working to build out a portfolio of tools that allow developers to leverage their skills and investments in .NET to use Hadoop. These projects are hosted on CodePlex, with packages available from NuGet to author jobs to run on HDInsight. For instructions on these, please see the getting started pages on the CodePlex site.
Running Jobs
In order to run any of these jobs, there are a few options:
- Run them directly from the head node. To do this, RDP to your cluster, open the Hadoop command prompt, and use the command line tools directly
- Submit them remotely using the REST API’s on the cluster (see the following section on integrating HDInsight with your applications for more details)
- Leverage tools on the HDInsight dashboard. After you create your cluster, there are a few capabilities in the cluster dashboard for submitting jobs:
- Create Job
- Interactive Console
Integrating HDInsight with your Applications
Open REST API’s
In order to provide a simple surface for client apps to integrate, we’ve worked to ensure that all capabilities on a cluster are surfaced via a set of secured REST API’s.
- WebHCatalog — Metadata management as well as remote job submission, history and management
- Ambari — Monitoring of a running cluster
- Oozie — Managing and Scheduling Oozie workflows
We are currently providing .NET clients to these API’s, available here, and one is able to easily build clients using the HTTP stacks in other languages as well.
Connectivity via ODBC
By leveraging the ODBC client (instructions here), one can easily integrate existing applications (Excel) with data that is being stored in Hive tables in HDInsight.
Debugging/Testing
In order to provide an experience where one can work disconnected from a cluster running in Azure, we have provided the HDInsight Developer Preview, a one-box setup, easily installed from the Web Platform Installer. You can use this to experiment, debug, and test all of the technologies above on a smaller set of data. You can then deploy the artifacts to Azure and run against your big data in Blob Storage. In order to install this, simply search for HDInsight inside the Web Platform Installer, or click here to install directly from the web.
Summary
This post covered the wide array of options that you have in order to write Hadoop jobs as well as integrated HDInsight into your applications. HDInsight enables you to develop with the platform and tools of your choice, from Java to .NET to JavaScript, on top of clusters that are easily deployed and managed using Windows Azure.
The final post in our 5-part series on HDInsight will explore how to analyze data from HDInsight with Excel. Stay tuned!