Data Movement in Big Data space through Azure Data Factory

By Harish Kumar Agarwal Senior Program Manager

Data Movement in Big Data space through Azure Data Factory • 2 min read

Posted on September 1, 2015
2 min read

Azure Data Factory provides a globally deployed service to support data movement across a variety of data stores. Azure Data Factory also has built-in support for securely moving data between on premise locations and cloud. The intent is to solve the data ingestion, movement and publish needs for your big data and advanced analytics scenarios.

Azure Data Factory connects to the following data stores:

Azure Blob
Azure Table
Azure SQL Database
Azure DocumentDB
SQL Server on IaaS
On-premises SQL Server
On-premises File System
On-premises Oracle Database
On-premises MySQL Database
On-premises DB2 Database
On-premises Teradata Database
On-premises Sybase Database
On-premises PostgreSQL Database
On-premises HDFS
Generic OData data store
Generic ODBC data store
Generic Web data store

We are looking to add connectivity to many more data stores, at a rapid pace, on a continuous basis. In the interim, you may use the .Net Activity to execute your own code in Azure Data Factory to connect to a data store of your choice.

Data movement in Azure Data Factory is surfaced through the Copy activity. This activity copies data from one data store to another. Copying is done in a batch mechanism as per the frequency and schedule defined. This leverages a globally deployed footprint underneath in order to efficiently copy data. This managed service safeguards against transient issues across a variety of data sources while also ensuring data is moved in a secure mechanism.

When moving data to/from an on-premises data store, a data management gateway is leveraged. Data management gateway is an agent you can install on-premises (behind a firewall) to enable hybrid data pipelines. It manages access to the on-premise data securely and enables seamless data movement between on-premise data stores.

A few of the other interesting functionalities include:

Data can be structured, semi-structured or unstructured for data movement to occur
For file based data stores:
- A variety of file formats such as binary, Text (CSV/TSV) and Avro are supported
- Encoding such as UTF-8, UTF- 16, gb2312, etc. can be selected specifically for Text format
- Three compression codecs – GZip, Deflate and BZip2 can be used to compress data if needed; the source and sink can use different compression algorithms
Columns of data from source can be skipped or mapped to specific columns in the sink during data movement
Type conversions: Different data stores have different native type systems. The copy activity performs automatic type conversions from source types to sink types. First, it converts the native source type to the corresponding .Net type. Then, it converts the .Net type to the corresponding native sink type. You will find the mapping for a given native type system to .NET for the data store in the respective data store connector articles. You can use these mappings to determine appropriate types while creating your tables to ensure the right conversions are performed during data movement.
When populating select relational stores, stored procedures can be invoked in order to execute custom logic during data movement to insert data into multiple tables simultaneously or overwrite/upsert
Repeatability mechanisms have been provided to ensure the re-run of copy activity does not produce redundant or incorrect data.

The net result is reliable, efficient, capability rich and cost-effective data movement via Azure Data Factory.

One can use this to:

Enable hybrid data movement between on-premise and Cloud and vice-versa
Load a data lake
Load a data warehouse
Lift and shift on-premise data analysis solution to Cloud

Learn more about data movement here. Give data movement in Azure Data Factory a try today, and let us know what you think on the feedback forum.

Data Movement in Big Data space through Azure Data Factory

Explore

Related posts

Enabling Diagnostic Logging in Azure API for FHIR®

Conformité avec le niveau de classification IRAP « Protected » de la couche infra à la couche d'application SAP sur Azure

MileIQ and Azure Event Hubs: Billions of miles streamed

Azure Stack IaaS – part ten

Join the conversation

Sélection

IA + Machine Learning

Analyse

Calcul

Conteneurs

Bases de données

DevOps

Outils de développement

Hybride + multicloud

Identité

Intégration

Internet des Objets

Gestion et gouvernance

Données multimédias

Migration

Réalité mixte

Mobile

Mise en réseau

Sécurité

Stockage

Web

Bureau virtuel Windows

Cas d'utilisation

Développement d’applications

IA

Migration et modernisation cloud

Données et analyse

Cloud hybride et infrastructure

Internet des Objets

Sécurité et gouvernance

Type d’organisation

Ressources

Explore

Related posts

Join the conversation