On-premises and cloud hybrid Hadoop data pipelines with Hortonworks and Cortana Analytics | Azure Blog

Azure Data Factory and Hortonworks Falcon team jointly announced the availability of private preview for building hybrid Hadoop data pipelines which leverages on-premises Hortonworks Hadoop clusters…

The Azure Data Factory and Hortonworks Falcon teams jointly announced the availability of private preview for building hybrid Hadoop data pipelines leveraging on-premises Hortonworks Hadoop clusters and cloud-based Cortana Analytics services like HDInsight Hadoop clusters and Azure Machine Learning.

Customers maintaining on-premises Hadoop based data lakes often need to enable hybrid data flows, extending on-premises data lake into the cloud for various reasons:

Keep PII and other sensitive data on-premises for privacy and compliance reasons, but leverage the cloud for elastic scale workloads that don’t need the sensitive information.
Leverage the cloud for cross region replication and disaster recovery.
Leverage the cloud for dev and test environments.

For such hybrid scenarios customers often find themselves stuck in a split world with two separate ETLdata pipeline solutions and no unified view of the data flows. The hybrid pipeline private preview solves this problem by allowing you to model and visualize your entire data flow and dependencies across on-premises and cloud as a cloud-based data factory. You get streamlined operations by leveraging the best in class monitoring and management tools for data factory from debugging failures to rerunning failed workflows regardless of where the job ran.

Hybrid Hadoop pipeline preview allows you to add your on-premises Hadoop cluster as a compute target for running jobs in Data Factory just like you would add other compute targets like an HDInsight based Hadoop cluster in cloud.

hybrid pipeline

You can enable connectivity between your on-premises cluster with data factory service over a secure channel with just a few clicks (GitHub sample link towards the end). Once you do that, as shown above, you can develop a hybrid pipeline that does the following:

Orchestrate Hadoop Hive and Pig jobs on-premises with the new on-premises Hadoop Hive, Pig activities in data factory.
Copy data from on-premises HDFS to Azure blob in cloud with the new on-premises replication activity.
Add more steps to the pipeline and continue big data processing in cloud with Hadoop HDInsight activity for example.

The private preview is available for a small set of select customers. If you are interested in leveraging this capability please take this brief survey and we will reach out to you if your use case is a good fit.

if you are part of the private preview already, you can find more details on how hybrid pipelines are enabled, how data factory and Falcon communicate with each other and for step-by-step instructions on how to set things up, please refer to our GitHub sample.