At the Strata + Hadoop World 2017 Conference in San Jose, we have announced the Spark to DocumentDB Connector. It enables real-time data science, machine learning, and exploration over globally distributed data in Azure DocumentDB. Connecting Apache Spark to Azure DocumentDB accelerates our customer’s ability to solve fast-moving data science problems, where data can be quickly persisted and queried using DocumentDB. The Spark to DocumentDB connector efficiently exploits the native DocumentDB managed indexes and enables updateable columns when performing analytics, push-down predicate filtering against fast-changing globally-distributed data, ranging from IoT, data science, and analytics scenarios. The Spark to DocumentDB connector uses the Azure DocumentDB Java SDK. You can get started today and download the Spark connector from GitHub!
What is DocumentDB?
These unique benefits make DocumentDB a great fit for both operational as well as analytical workloads for applications including web, mobile, personalization, gaming, IoT, and many other that need seamless scale and global replication.
What are the benefits of using DocumentDB for machine learning and data science?
Figure 1: With Spark Connector for DocumentDB, data is parallelized between the Spark worker nodes and DocumentDB data partitions
Distributed aggregations and advanced analytics
While Azure DocumentDB has aggregations (SUM, MIN, MAX, COUNT, SUM and working on GROUP BY, DISTINCT, etc.) as noted in Planet scale aggregates with Azure DocumentDB, connecting Apache Spark to DocumentDB allows you to easily and quickly perform an even larger variety of distributed aggregations by leveraging Apache Spark. For example, below is a screenshot of calculating a distributed MEDIAN calculation using Apache Spark's PERCENTILE_APPROX function via Spark SQL.
select destination, percentile_approx(delay, 0.5) as median_delay from df where delay < 0 group by destination order by percentile_approx(delay, 0.5)
Figure 2: Area visualization for the above distributed median calculation via Jupyter notebook service on Spark on Azure HDInsight.
Push-down predicate filtering
As noted in the following animated gif, the queries from Apache Spark will push down predicated to Azure DocumentDB and take advantage that DocumentDB indexes every attribute by default. Furthermore, by pushing computation close to the where the data lives, we can do processing in-situ, and reduce the amount of data that needs to be moved. At global scale, this results in tremendous performance speedups for analytical queries.
For example, if you only want to ask for the flights departing from Seattle (SEA), the Spark to DocumentDB connector will:
- Send the query to Azure DocumentDB.
- As all attributes within Azure DocumentDB are automatically indexed, only the flights pertaining to Seattle will be returned to the Spark worker nodes quickly.
This way as you perform your analytics, data science, or ML work, you will only transfer the data you need.
Blazing fast IoT scenarios
Azure DocumentDB is designed for high-throughput, low-latency IoT environments. The animated GIF below refers to a flights scenario.
Together, you can:
- Handle high throughput of concurrent alerts (e.g., weather, flight information, global safety alerts, etc.)
- Send this information downstream for device notifications, RESTful services, etc. (e.g., alert on your phone of an impending flight delay) including the use of change feed
- At the same time, as you are building up ML models against your data, you can also make sense of the latest information
Related to the previously noted blazing fast IoT scenarios, let's dive into updateable columns:
As the new piece of information comes in (e.g. the flight delay has changed from 5 min to 30 min), you want to be able to quickly re-run your machine learning (ML) models to reflect this newest information. For example, you can predict the impact of the 30min for all the downstream flights. This event can be quickly initiated via the Azure DocumentDB Change Feed to refresh your ML models.
In this blog post, we’ve looked at the new Spark to DocumentDB Connector. The Spark with DocumentDB enables both ad-hoc, interactive queries on big data, as well as advanced analytics, data science, machine learning, and artificial intelligence. DocumentDB can be used for capturing data that is collected incrementally from various sources across the globe. This includes social analytics, time series, game or application telemetry, retail catalogs, up-to-date trends and counters, and audit log systems. Spark can then be used for running advanced analytics and AI algorithms at scale on top of the data coming from DocumentDB.
Companies and developers can employ this scenario in online shopping recommendations, spam classifiers for real time communication applications, predictive analytics for personalization, and fraud detection models for mobile applications that need to make instant decisions to accept or reject a payment. Finally, internet of things scenarios fit in here as well, with the obvious difference that the data represents the actions of machines instead of people.
To get started running queries, create a new DocumentDB account from the Azure Portal and work with the project in our Azure-DocumentDB-Spark GitHub repo. Complete instructions are available in the Connecting Apache Spark to Azure DocumentDB article.