Pomiń nawigację

Introducing the #Azure #CosmosDB Bulk Executor library

Opublikowano: 7 maja, 2018

Software Engineer

Azure Cosmos DB is a fast and flexible globally-distributed database service that is designed to elastically scale-out to support:

  • Large read and write throughputs up to millions of operations per second.
  • Storing high volumes of transactional and operational data with predictable millisecond latency, hundreds of terabytes, or more.

To help you leverage this massive throughput and storage potential, we introduce the Bulk Executor library in .NET and Java. The Bulk Executor library allows you to perform bulk operations in Azure Cosmos DB through APIs for bulk import and update. To get started, download the Bulk Executor library for .NET or Java. You can read more about the functionalities of Bulk Executor library in the following sections.

What does the Bulk Executor Library provide?

  • Significantly reduces the client-side compute resources needed to saturate throughput allocated to a container. Compared to a multi-threaded application that writes data in parallel whilst saturating the client machine's CPU, a single thread writing data using the bulk import API achieves a 10x greater write throughput on the same machine.
  • Abstracts away the tedious tasks of writing application logic to handle throttling, request timeouts, and other transient exceptions by efficiently handling them within the library.
  • Provides a simplified mechanism for applications performing bulk operations to scale out. A single Bulk Executor instance running on an Azure Virtual Machine (VM) can consume greater than 500K RU/s and you can achieve a higher throughput rate by adding additional Bulk Executor instances on individual client VMs.
  • Ability to import over a terabyte of data in under an hour utilizing a scaled-out architecture using the bulk import API. 
  • Ability to bulk update existing data in Azure Cosmos DB containers as patches.

How does the Bulk Executor Library maximally consume throughput allocated to a collection?

When a bulk operation to import or update documents is triggered with a batch of entities on the client-side, they are initially shuffled into buckets corresponding to their target Azure Cosmos DB partition key range. Within each bucket corresponding to a partition key range, they are broken down into mini-batches and each mini-batch acts as a payload that is committed transactionally on the server-side.

We have built in optimizations for the concurrent execution of these mini-batches both within and across partition key ranges to maximally utilize the throughput allocated to a collection. We have designed an AIMD-style congestion control mechanism for each Azure Cosmos DB partition key range to efficiently handle throttling and timeouts.

Bulk API tech diagram_final
These client-side optimizations augment server-side features specific to the Bulk Executor library which together make maximal consumption of available throughput.

Next steps

  • Review the documentation for working with the Bulk Executor library in .NET and Java.
  • Try out sample applications consuming the Bulk Executor library in .NET and Java.
  • The library is available for download in .NET via Nuget and Java via Maven.
  • We have integrated the Bulk Executor library into the Azure Cosmos DB Spark connector as documented on Nuget.
  • We have also integrated the Bulk Executor library into a new version of Azure Cosmos DB connector for Azure Data Factory to copy data.

Stay up-to-date on the latest Azure Cosmos DB news and features by following us on Twitter @AzureCosmosDB and #CosmosDB, and reach out to us on the developer forums on Stack Overflow.