• <1 minute

Create big data pipelines using Azure Data Lake and Azure Data Factory

Microsoft announced a new and expanded Azure Data Lake making big data processing and analytics simpler and more accessible.

This week, Microsoft announced a new and expanded Azure Data Lake making big data processing and analytics simpler and more accessible. The expanded Azure Data Lake includes Azure Data Lake Store, Azure Data Lake Analytics and Azure HDInsight.

The Azure Data Lake Store provides a single repository where you can easily capture data of any size, type and speed without forcing changes to your application as data scales. Azure Data Lake Analytics is a new service built on Apache YARN  and includes U-SQL, a language that unifies the benefits of SQL with the expressive power of user code. The service dynamically scales; so you can focus on your business goals. The service allows you to do analytics on any kind of data with enterprise grade security through Azure Active Directory.

You will be able to create big data Pipelines in Azure Data Factory using Azure Data Lake Store and Azure Data Lake Analytics in addition to using our existing support for Azure HDInsight. This capability will be added to ADF when Azure Data Lake Store and Azure Data Lake Analytics are available in preview later this year. It will allow you to do the following:

Move data from (source) and to (sink) Azure Data Lake Store

You will be able to move data from the following sources to Azure Data Lake Store:

  • Azure Blob
  • Azure SQL Database
  • Azure Table
  • On-premises SQL Server Database
  • Azure DocumentDB
  • Azure SQL DW
  • On-premises File System
  • On-premises Oracle Database
  • On-premises MYSQL Database
  • On-premises DB2 Database
  • On-premises Teradata Database
  • On-premises Sybase Database
  • On-premises PostgreSQL Database
  • On-premises HDFS
  • Generic OData
  • Generic ODBC

You will also be able to move data from Azure Data Lake Store to a number of sinks viz. Azure Blob, Azure SQL Database, on-premises file system etc..

We are continuously working to add more data sources and expand the support matrix each month. For example:

The below pipeline showcases data movement from Azure Blob Storage to Azure Data Lake Store using the Copy Activity in Azure Data Factory.

Data Movement Pipeline

Create E2E big data ADF pipelines that run U-SQL scripts as a processing step on Azure Data Lake Analytics service

For example, a very common use case for multiple industry verticals (retail, finance, gaming) is Log Processing. The below E2E big data workflow showcases the following:

  • ADF pipeline moving log data from Azure Blob Storage to Azure Data Lake Store.
  • ADF pipeline consumes the logs copied to Azure Data Lake Store account in previous step and processes logs by running U-SQL script on Azure Data Lake Analytics as one of the processing step. The U-SQL script computes events by region that can be consumed by downstream processes.

ADF Pipeline

To summarize, you will be able to build E2E big data pipelines using Azure Data Factory that will allow you to move data from a number of sources to Azure Data Lake Store and vice versa. In addition, you will be able to run U-SQL scripts on Azure Data Lake Analytics as one of the processing step and dynamically scale according to your needs. We will continue to invest in solutions allowing us to operationalize big data processing and analytics workflows.

Click here to learn more about the Microsoft Azure Data Lake from the Microsoft Cloud Platform team. If you want to try out Azure Data Factory, visit us here and get started by building pipelines easily and quickly using data factory. If you have any feature requests or want to provide feedback for data factory, please visit the Azure Data Factory Forum.