Real-time machine learning on globally-distributed data with Apache Spark and DocumentDB

Data di pubblicazione: 5 aprile, 2017

Principal Program Manager, Azure DocumentDB

As of May 10th 2017,

Azure Cosmos DB is Microsoft’s globally distributed multi-model database. Azure Cosmos DB was built from the ground up with global distribution and horizontal scale at its core. It offers turnkey global distribution across any number of Azure regions by transparently scaling and replicating your data wherever your users are. Elastically scale throughput and storage worldwide, and pay only for the throughput and storage you need. Azure Cosmos DB guarantees single-digit-millisecond latencies at the 99th percentile anywhere in the world, offers multiple well-defined consistency models to fine-tune performance, and guarantees high availability with multi-homing capabilities—all backed by industry leading service level agreements (SLAs). 

Azure Cosmos DB is truly schema-agnostic; it automatically indexes all the data without requiring you to deal with schema and index management. It’s also multi-model, natively supporting document, key-value, graph, and column-family data models. With Azure Cosmos DB, you can access your data using APIs of your choice, as DocumentDB SQL (document), MongoDB (document), Azure Table Storage (key-value), and Gremlin (graph) are all natively supported.


At the Strata + Hadoop World 2017 Conference in San Jose, we have announced the Spark to DocumentDB Connector. It enables real-time data science, machine learning, and exploration over globally distributed data in Azure DocumentDB. Connecting Apache Spark to Azure DocumentDB accelerates our customer’s ability to solve fast-moving data science problems, where data can be quickly persisted and queried using DocumentDB. The Spark to DocumentDB connector efficiently exploits the native DocumentDB managed indexes and enables updateable columns when performing analytics, push-down predicate filtering against fast-changing globally-distributed data, ranging from IoT, data science, and analytics scenarios. The Spark to DocumentDB connector uses the Azure DocumentDB Java SDK. You can get started today and download the Spark connector from GitHub!

What is DocumentDB?

Azure DocumentDB is our globally distributed database service designed to enable developers to build planet scale applications. DocumentDB allows you to elastically scale both, throughput and storage across any number of geographical regions. The service offers guaranteed low latency at P99, 99.99% high availability, predictable throughput, and multiple well-defined consistency models, all backed by comprehensive SLAs. By virtue of its schema-agnostic and write optimized database engine, by default DocumentDB is capable of automatically indexing all the data it ingests and serve SQL, MongoDB, and JavaScript language-integrated queries in a scale-independent manner. As a cloud service, DocumentDB is carefully engineered with multi-tenancy and global distribution from the ground up.
These unique benefits make DocumentDB a great fit for both operational as well as analytical workloads for applications including web, mobile, personalization, gaming, IoT, and many other that need seamless scale and global replication.

What are the benefits of using DocumentDB for machine learning and data science?

DocumentDB is truly schema-free. By virtue of its commitment to the JSON data model directly within the database engine, it provides automatic indexing of JSON documents without requiring explicit schema or creation of secondary indexes. DocumentDB supports querying JSON documents using well-familiar SQL language. DocumentDB query is rooted in JavaScript's type system, expression evaluation, and function invocation. This, in turn, provides a natural programming model for relational projections, hierarchical navigation across JSON documents, self joins, spatial queries, and invocation of user defined functions (UDFs) written entirely in JavaScript, among other features. We have now expanded the SQL grammar to include aggregations, thus enabling globally-distributed aggs in addition to these capabilities.

Apache Spark

Figure 1: With Spark Connector for DocumentDB, data is parallelized between the Spark worker nodes and DocumentDB data partitions

Distributed aggregations and advanced analytics

While Azure DocumentDB has aggregations (SUM, MIN, MAX, COUNT, SUM and working on GROUP BY, DISTINCT, etc.) as noted in Planet scale aggregates with Azure DocumentDB, connecting Apache Spark to DocumentDB allows you to easily and quickly perform an even larger variety of distributed aggregations by leveraging Apache Spark. For example, below is a screenshot of calculating a distributed MEDIAN calculation using Apache Spark's PERCENTILE_APPROX function via Spark SQL.

select destination, percentile_approx(delay, 0.5) as median_delay
from df
where delay < 0
group by destination
order by percentile_approx(delay, 0.5)

Figure 2

Figure 2: Area visualization for the above distributed median calculation via Jupyter notebook service on Spark on Azure HDInsight.

Push-down predicate filtering

As noted in the following animated gif, the queries from Apache Spark will push down predicated to Azure DocumentDB and take advantage that DocumentDB indexes every attribute by default. Furthermore, by pushing computation close to the where the data lives, we can do processing in-situ, and reduce the amount of data that needs to be moved. At global scale, this results in tremendous performance speedups for analytical queries.

Figure 3

For example, if you only want to ask for the flights departing from Seattle (SEA), the Spark to DocumentDB connector will:

  • Send the query to Azure DocumentDB.
  • As all attributes within Azure DocumentDB are automatically indexed, only the flights pertaining to Seattle will be returned to the Spark worker nodes quickly.

This way as you perform your analytics, data science, or ML work, you will only transfer the data you need.

Blazing fast IoT scenarios

Azure DocumentDB is designed for high-throughput, low-latency IoT environments. The animated GIF below refers to a flights scenario.

Figure 4

Together, you can:

  • Handle high throughput of concurrent alerts (e.g., weather, flight information, global safety alerts, etc.)
  • Send this information downstream for device notifications, RESTful services, etc. (e.g., alert on your phone of an impending flight delay) including the use of change feed
  • At the same time, as you are building up ML models against your data, you can also make sense of the latest information

Updateable columns

Related to the previously noted blazing fast IoT scenarios, let's dive into updateable columns:

Figure 5

As the new piece of information comes in (e.g. the flight delay has changed from 5 min to 30 min), you want to be able to quickly re-run your machine learning (ML) models to reflect this newest information. For example, you can predict the impact of the 30min for all the downstream flights. This event can be quickly initiated via the Azure DocumentDB Change Feed to refresh your ML models.

Next steps

In this blog post, we’ve looked at the new Spark to DocumentDB Connector. The Spark with DocumentDB enables both ad-hoc, interactive queries on big data, as well as advanced analytics, data science, machine learning, and artificial intelligence. DocumentDB can be used for capturing data that is collected incrementally from various sources across the globe. This includes social analytics, time series, game or application telemetry, retail catalogs, up-to-date trends and counters, and audit log systems. Spark can then be used for running advanced analytics and AI algorithms at scale on top of the data coming from DocumentDB.

Companies and developers can employ this scenario in online shopping recommendations, spam classifiers for real time communication applications, predictive analytics for personalization, and fraud detection models for mobile applications that need to make instant decisions to accept or reject a payment. Finally, internet of things scenarios fit in here as well, with the obvious difference that the data represents the actions of machines instead of people.

To get started running queries, create a new DocumentDB account from the Azure Portal and work with the project in our Azure-DocumentDB-Spark GitHub repo. Complete instructions are available in the Connecting Apache Spark to Azure DocumentDB article.

Stay up-to-date on the latest DocumentDB news and features by following us on Twitter @DocumentDB or reach out to us on the developer forums on Stack Overflow.