Azure HDInsight, Uncategorized, data-factory

Create big data pipelines using Azure Data Lake and Azure Data Factory

By Gaurav Malhotra Principal Program Manager, Azure Data Factory

Create big data pipelines using Azure Data Lake and Azure Data Factory • 2 min read

Posted on October 6, 2015
2 min read

This week, Microsoft announced a new and expanded Azure Data Lake making big data processing and analytics simpler and more accessible. The expanded Azure Data Lake includes Azure Data Lake Store, Azure Data Lake Analytics and Azure HDInsight.

The Azure Data Lake Store provides a single repository where you can easily capture data of any size, type and speed without forcing changes to your application as data scales. Azure Data Lake Analytics is a new service built on Apache YARN and includes U-SQL, a language that unifies the benefits of SQL with the expressive power of user code. The service dynamically scales; so you can focus on your business goals. The service allows you to do analytics on any kind of data with enterprise grade security through Azure Active Directory.

You will be able to create big data Pipelines in Azure Data Factory using Azure Data Lake Store and Azure Data Lake Analytics in addition to using our existing support for Azure HDInsight. This capability will be added to ADF when Azure Data Lake Store and Azure Data Lake Analytics are available in preview later this year. It will allow you to do the following:

Move data from (source) and to (sink) Azure Data Lake Store

You will be able to move data from the following sources to Azure Data Lake Store:

Azure Blob
Azure SQL Database
Azure Table
On-premises SQL Server Database
Azure DocumentDB
Azure SQL DW
On-premises File System
On-premises Oracle Database
On-premises MYSQL Database
On-premises DB2 Database
On-premises Teradata Database
On-premises Sybase Database
On-premises PostgreSQL Database
On-premises HDFS
Generic OData
Generic ODBC

You will also be able to move data from Azure Data Lake Store to a number of sinks viz. Azure Blob, Azure SQL Database, on-premises file system etc..

We are continuously working to add more data sources and expand the support matrix each month. For example:

The below pipeline showcases data movement from Azure Blob Storage to Azure Data Lake Store using the Copy Activity in Azure Data Factory.

Create E2E big data ADF pipelines that run U-SQL scripts as a processing step on Azure Data Lake Analytics service

For example, a very common use case for multiple industry verticals (retail, finance, gaming) is Log Processing. The below E2E big data workflow showcases the following:

ADF pipeline moving log data from Azure Blob Storage to Azure Data Lake Store.
ADF pipeline consumes the logs copied to Azure Data Lake Store account in previous step and processes logs by running U-SQL script on Azure Data Lake Analytics as one of the processing step. The U-SQL script computes events by region that can be consumed by downstream processes.

To summarize, you will be able to build E2E big data pipelines using Azure Data Factory that will allow you to move data from a number of sources to Azure Data Lake Store and vice versa. In addition, you will be able to run U-SQL scripts on Azure Data Lake Analytics as one of the processing step and dynamically scale according to your needs. We will continue to invest in solutions allowing us to operationalize big data processing and analytics workflows.

Click here to learn more about the Microsoft Azure Data Lake from the Microsoft Cloud Platform team. If you want to try out Azure Data Factory, visit us here and get started by building pipelines easily and quickly using data factory. If you have any feature requests or want to provide feedback for data factory, please visit the Azure Data Factory Forum.

Related posts

Gérer vos besoins big data avec HDInsight sur AKS

Monitoring on Azure HDInsight part 4: Workload metrics and logs

Migrate to Azure HDInsight in as little as 12 weeks

Azure.Source – Volume 87

Join the conversation

Sélection

IA + Machine Learning

Analyse

Calcul

Conteneurs

Bases de données

DevOps

Outils de développement

Hybride + multicloud

Identité

Intégration

Internet des Objets

Gestion et gouvernance

Données multimédias

Migration

Réalité mixte

Mobile

Mise en réseau

Sécurité

Stockage

Web

Bureau virtuel Windows

Cas d'utilisation

Développement d’applications

IA

Migration et modernisation cloud

Données et analyse

Cloud hybride et infrastructure

Internet des Objets

Sécurité et gouvernance

Type d’organisation

Ressources

Move data from (source) and to (sink) Azure Data Lake Store

Create E2E big data ADF pipelines that run U-SQL scripts as a processing step on Azure Data Lake Analytics service

Explore

Related posts

Join the conversation