Stream Processing Changes: #Azure #CosmosDB change feed + Apache Spark

Posted on August 29, 2017

Program Manager, CosmosDB

Azure Cosmos DB: Ingestion and storage all-in-one

Azure Cosmos DB is a blazing fast, globally distributed, multi-model database service. Regardless of where your customers are, they can access data stored in Azure Cosmos DB with single-digit latencies at the 99th percentile at a sustained high rate of ingestion. This speed supports using Azure Cosmos DB, not only as a sink for stream processing, but also as a source. In a previous blog, we explored the potential of performing real-time machine learning with Apache Spark and Azure Cosmos DB. In this article, we will further explore stream processing of updates to data with Azure Cosmos DB change feed and Apache Spark.

What is Azure Cosmos DB change feed?

Azure Cosmos DB change feed provides a sorted list of documents within an Azure Cosmos DB collection in the order in which they were modified. This feed can be used to listen for modifications to data within the collection to perform real-time (stream) processing on updates. Changes in Azure Cosmos DB are persisted and can be processed asynchronously, and distributed across one or more consumers for parallel processing. Change feed is enabled at collection creation and is simple to use with the change feed processor library.

Designing your system with Azure Cosmos DB

Traditionally, stream processing implementations first receive a high volume of incoming data into a temporary message queue such as Azure Event Hub or Apache Kafka. After stream processing the data, a materialized view or aggregate is stored into a persistent, query-able database. In this implementation, we can use the Azure Cosmos DB Spark connector to store Spark output into Azure Cosmos DB for document, graph, or table schemas. This design is great for scenarios where only a portion of the incoming data or only an aggregate of incoming data is useful.

traditional-stream-processing-modelFigure 1: Traditional stream processing model

Let’s consider the scenario of credit card fraud detection. All incoming data (new transactions) need to be persisted as soon as they are received. As new data comes in, we want to incrementally apply a machine learning classifier to detect fraudulent behavior.

detecting-credit-card-fraud

Figure 2: Detecting credit card fraud

In this scenario, Azure Cosmos DB is a great choice for directly ingesting all the data from new transactions because of its unique ability to support a sustained high rate of ingestion while durably persisting and synchronously indexing the raw records, enabling these records to be served back out with low latency rich queries. From the Azure Cosmos DB change feed, you can connect compute engines such as Apache Storm, Apache Spark or Apache Hadoop to perform stream or batch processing. Post processing, the materialized aggregates or processed data can be stored back into Azure Cosmos DB permanently for future querying.

azure-cosmosdb-sink-and-source

Figure 3: Azure Cosmos DB sink and source

You can learn more about change feed in the Working with the change feed support in Azure Cosmos DB article, and by trying the change feed + Spark example on GitHub. If you need any help or have questions or feedback, please reach out to us on the developer forums on Stack Overflow. Stay up-to-date on the latest Azure Cosmos DB news and features by following us on Twitter @AzureCosmosDB.