Extract, transform, and load (ETL) using HDInsight

Azure Data Factory
Azure Data Lake Storage
Azure HDInsight

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.

This solution idea illustrates how to extract, transform, and load your big data clusters on demand by using Hadoop MapReduce and Apache Spark.

Architecture

Diagram showing the dataflow for extract, transform, and load big data clusters by using Azure HDInsight, Hadoop MapReduce, and Apache Spark.

Download a Visio file of this architecture.

Dataflow

The data flows through the architecture as follows:

  1. Using Azure Data Factory, establish Linked Services to source systems and data stores. Azure Data Factory Pipelines support 90+ connectors that also include generic protocols for data sources where a native connector isn't available.

  2. Load data from source systems into Azure data lake with the Copy Data tool.

  3. Azure Data Factory is able to create an on-demand HDInsight cluster. Start by creating an On-Demand HDInsight Linked Service. Next, create a pipeline and use the appropriate HDInsight activity depending on the Hadoop framework being used (that is, Hive, MapReduce, Spark, etc.).

  4. Trigger the pipeline in Azure Data Factory. The architecture assumes Azure Data Lake store is used as the file system in the Hadoop script executed by the HDInsight activity which was created in Step 3. The script will be executed by an on-demand HDInsight cluster that will write data to a curated area of the data lake.

Components

  • Azure Data Factory - Cloud scale data integration service for orchestrating data flow.
  • Azure Data Lake Storage - Scalable and cost-effective cloud storage for big data processing.
  • Apache Hadoop - Big data distributed processing framework
  • Apache Spark - Big data distributed processing framework that supports in-memory processing to boost performance for big data applications.
  • Azure HDInsight - Cloud distribution of Hadoop components.

Scenario details

This solution idea describes the data flow for an ETL use case.

Potential use cases

You can use Azure HDInsight for various scenarios in big data processing. It can be historical data (data that's already collected and stored) or real-time data (data that's directly streamed from the source). For more information about processing such data, see Scenarios for using HDInsight.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps

Learn more about the component technologies:

Explore related architectures: