Apart from adding connectivity to various data stores and endpoints (current list available here) and enabling working with various data formats (currently supported data formats are Text, Avro and JSON), Azure Data Factory is also enabling faster data movement. We are doing this by increasing the throughput of the data movement performed through Azure Data Factory. The following is possible today using the copy capability in Azure Data Factory.
- Ingest 1 TB data into Azure Blob Storage from on-premises File System and Azure Blob Storage in about three hours (i.e. @ 100 MBps)
- Ingest 1 TB data into Azure Data Lake Store from on-premises File System and Azure Blob Storage in about three hours (i.e. @ 100 MBps)
- Ingest 1 TB data into Azure SQL Data Warehouse from Azure Blob Storage in about three hours (i.e. @ 100 MBps)
These have been enabled in the following ways.
Enabling parallelism during movement
One of the ways to enhance the throughput of a copy operation and reduce the overall time for data movement is to read data from source and/or write data to destination in parallel. We have enabled exactly that.
You now have ways to specify the parallelism factor when reading the data from the source store and when writing the data to the destination store. You could also decide to not specify it and the service will automatically figure out the best for you.
When data is being copied in parallel, the copy operation requires more resources in terms of processing power, memory and network allocation. These resources are collectively referred to as Cloud Data Movement Unit in Azure Data Factory and is applicable when performing a cloud to cloud copy. You can also tune the number of cloud units associated with a copy activity now.
Learn more about how to make the best use of this capability and related guidance. Using this capability should have little to no impact on your ADF billing.
Leveraging PolyBase to load data into Azure SQL Data Warehouse
PolyBase is an efficient way of loading large amount of data into Azure SQL Data Warehouse. We have observed up to about 300x performance improvement (from .3 MBps to 100 MBps) using PolyBase to perform data movement from Azure Blobs to Azure SQL Data Warehouse when compared to the default BULKINSERT mechanism.
However, PolyBase does not operate with all data stores, formats and types. If your source data store, format or type is not compatible with what PolyBase requires, you could consider copying the data from the source data store to Azure Blob Storage as a staging store first and then use PolyBase to load that data from the staging store into Azure SQL Data Warehouse.
Learn how on how to leverage PolyBase in Azure Data Factory and its requirements and guidelines.
We will continue to surface ways to make data movement to, within and out of Azure faster. If you have any feedback on the above capabilities please visit Azure Data Factory User Voice and/or MSDN Forums to reach out. We are eager to hear from you!