Azure HDInsight and Azure Storage

In our last blog post we provided a walkthrough of the updated HDInsight Service on Windows Azure.  Today’s post, which focuses on HDInsight and Azure Storage, is the third in our 5-blog series on HDInsight.

One of the interesting and differentiating aspects of the HDInsight Service on Windows Azure is the ability to choose where you want to store the data. You can store it in the native HDFS file system that is local to the compute nodes, or you can use an Azure Blob Store Container as an HDFS file system to store your data. In fact, when you provision your HDInsight cluster, it will per default create an Azure Blob Store container in your storage account as the default HDFS file system.

Alternatively, you can choose an existing Azure Blob Store Container as your default HDFS file system by creating your cluster through the custom create option. For example, in the screen shot you see how you can specify a Blob Store container with the name ‘netflix’ as the default file system.

This container could have been previously provisioned as an HDInsight HDFS file system, or it could have just been an arbitrary Azure Blob Store container that happens to contain data that you want to analyze!

In our case, the netflix container contains three blobs that use a folder naming scheme:

Benefits of using an Azure Storage Container

While the storage container is not local to the compute nodes, and thus seem to violate the Hadoop paradigm of co-locating compute with storage, there are several benefits associated with storing the data in an Azure Blob Store container:

Data Reuse and Sharing: The data inside the compute nodes is “locked” behind the HDFS APIs. This means that only applications that have knowledge of HDFS and have access to the compute cluster can use the data. The data in an Azure Storage container can be accessed either through the HDFS APIs or through the Azure Blob Store REST APIs. Thus, a larger set of applications and tools can be used to produce and consume the data and different applications can produce the data while others consume it.

Data Archiving: Since the data inside the compute nodes only lives as long as you have provisioned your HDInsight cluster, you either have to keep your cluster alive beyond your compute time, or you have to reload your data into the cluster every time you provision one to perform your computations. In an Azure Storage container you can keep the data stored for as long as you wish.

Data Storage Cost: Storing data inside an active HDInsight Cluster for the long term will be more costly than storing the data in an Azure Storage container, since the cost of a compute cluster is higher than the cost of an Azure Blob Store container. In addition, since the data does not have to be reloaded for every compute cluster generation, you are saving data loading costs as well.

Elastic scale-out: While the HDInsight cluster provides you with a scaled-out file system, the scale is determined by the number of nodes that you provision for your cluster. Changing the scale can become a more complicated process than relying on the Azure Blob Store’s elastic scaling capabilities that you get automatically when using an Azure Storage container.

Geo Replication: Your Azure Blob Store containers can be geo replicated through the Azure Portal! While this gives you geographic recovery and data redundancy, a fail-over to the geo replicated location will severely impact your performance and may incur additional costs. So our recommendation is to choose the geo replication wisely and only if the value of the data is worth the additional costs.

Furthermore, the implied performance cost of not having compute and storage co-located is actually mitigated by the way the compute clusters are provisioned close to the storage account resources inside the Azure data center, where the high speed network makes it very efficient for the compute nodes to access the data inside ASV. Depending on general load, compute and access patterns, we have observed only slight performance degradation and often even faster access!

And please don’t forget, by not having to reload the data into the file system every time you provision an HDInsight cluster, you can save on data load times and data movement charges!

How to use the Azure Storage

Let’s look at a simple example using the Azure Blob Store container we designated as the default file system above. We can check the content of the file system through the JavaScript console using the standard HDFS file system command:

As you notice, a couple of additional directories got created, but the file system looks like any other HDFS file system. I can of course also address it with the explicit URI scheme that we designed to address Azure Blob Store containers:

I have the option of using the asvs scheme with SSL as above, or without, by specifying the command as #lsr asv://netflix@mryshadoop.blob.core.windows.net/movie. Note that creation dates of directories that are only implied by the name of the blob are currently shown as 1970-01-01 00:00. This may change in the future.

Now you can create hive tables and run hive queries or run your other map-reduce jobs over the data.

Learn More

This blog gave you a small taste of using Azure Blob Storage with HDInsight and its benefits.  A more detailed tutorial will soon be available.  To continue learning about HDInsight, visit our Getting Started page.

We hope you will find HDInsight a valuable new service and are looking forward to your feedback.

The next blog in our 5-part series will cover the Developer experience on HDInsight. Stay tuned!