How Microsoft builds massively scalable services using Azure DocumentDB

Publikováno dne 20 dubna, 2017

Program Manager, Azure DocumentDB

This week at Microsoft Data Amp we covered how you can harness the incredible power of data using Microsoft’s latest innovations in its Data Platform. One of the key pieces in the Data Platform is Azure DocumentDB, Microsoft’s globally distributed NoSQL database service. Released in 2015, DocumentDB is being used virtually ubiquitously as a backend for first-party Microsoft services for many years.

DocumentDB is Microsoft's multi-tenant, globally distributed database system designed to enable developers to build planet scale applications. DocumentDB allows you to elastically scale both, throughput and storage across any number of geographical regions. The service offers guaranteed low latency at P99, 99.99% high availability, predictable throughput, and multiple well-defined consistency models, all backed by comprehensive SLAs. By virtue of its schema-agnostic and write optimized database engine, by default DocumentDB is capable of automatically indexing all the data it ingests and serve SQL, MongoDB, and JavaScript language-integrated queries in a scale-independent manner. As a cloud service, DocumentDB is carefully engineered with multi-tenancy and global distribution from the ground up.

In this blog, we cover case studies of first-party applications of DocumentDB by the Windows, Universal Store, and Azure IoT Hub teams, and how these teams could harness the scalability, low latency, and flexibility benefits of DocumentDB to innovate and bring business value to their services.

Microsoft DnA: How Microsoft uses error reporting and diagnostics to improve Windows

The Windows Data and Analytics (DnA) team in Microsoft implements the crash reporting technology for Windows. One of their components runs as a Windows Service in every Windows device. Whenever an application stops responding on a user's desktop, Windows collects post-error debug information and prompts the user to ask if they’re interested in finding a solution to the error. If the user accepts, the dump is sent over the Internet to the DnA service. When a dump reaches the service, it is analyzed and a solution is sent back to the user when one is available.

Windows error reporting diagnostic information



Windows' need for fast key-value lookups

In DnA’s terminology, crash reports are organized into “buckets”. Each bucket is used to classify an issue by key attributes such as Application Name, Application Version, Module Name, Module Version, and OS Exception code. Each bucket contains crash reports that are caused by the same bug. With the large ecosystem of hardware and software vendors, and 15 years of collected data about error reports, the DnA service has over 10 billion unique buckets in its database cluster.

One of the DnA team’s requirements was rather simple at face value. Given the hash of a bucket, return the ID corresponding to its bucket/issue if one was available. However, the scale posed interesting technical challenges. There was a lot of data (10 billion buckets, growing at 6 million a day), high volume of requests and global reach (requests from any device running Windows), and low latency requirements (to ensure a good user experience).

To store “Bucket Dimensions”, the DnA team provisioned a single DocumentDB collection with 400,000 request units per second of provisioned throughput. Since all access was by the primary key, they configured the partition key to be the same as the “id”, with a digest of the various attributes as the value. As DocumentDB provided <10 ms read latency and <15ms write latency at p99, DnA could perform fast lookups against buckets and lookup issues even as their data and request volumes continued to grow over time.

Windows cab catalog metadata and query

Aside from fast real-time lookups, the DnA team also wanted to use the data to drive engineering decisions to help improve Microsoft and other vendors’ products by fixing the most impactful issues. For example, the team has observed that addressing the top 1 percent of reliability issues could address 50 percent of customers’ issues. This analysis required storing the crash dump binary files, “cabs”, extracting useful metadata, then running analysis and reports against this data. This presented a number of interesting challenges on its own.

  • The team deals with approximately 600 different types of reliability-incident data. Managing the schema and indexes required a significant engineering and operational overhead on the team.
  • The cab metadata was also a big volume of data. There were about 5 billion cabs, and 30 million new cabs were added every day.

The DnA team could migrate their Bucket Dimension and Cab Catalog stores to DocumentDB from their earlier solution based on an on-premises cluster of SQL Servers. Since shifting the database’s heavy lifting to DocumentDB, DnA benefited from the speed, scale, and flexibility offered by DocumentDB. More importantly, they could focus less on maintenance of their database and more on improving user experience on Windows.


You can read the case study at Microsoft’s DnA team achieves planet-scale big-data collection with Azure DocumentDB.

Microsoft Global Homing Service: How Xbox Live and Universal Store build highly available location services

Microsoft’s Universal Store team implements the e-commerce platform that is used to power Microsoft’s storefronts across Windows Store, Xbox, and a large set of Microsoft services. One of the key internal components in the Universal Store backend is the Global Homing Service (GHS), a highly reliable service that provides its downstream consumers with the ability to quickly retrieve location metadata associated with one to many, arbitrary large number of, IDs.

Global Homing Service (GHS) using Azure DocumentDB across 4 regions


GHS is on a hot path for the majority of its consumer services and receives hundreds of thousands of requests per second. Therefore, the latency and throughput requirements for the service are strict. The service had to maintain 99.99% availability and predictable latencies under 300ms end-to-end at the 99.9th percentile to satisfy requirements of its partner teams. To reduce latencies, the service is geo-distributed so that it is as close as possible to calling partner services.

The initial design of GHS was implemented using a combination of Azure Table Storage and various levels of caches. This solution worked well for the initial set of loads, but given the critical nature of GHS and increased adoption of the service from key partners, it became apparent that the existing SLA was not going to meet their partners’ P99.9 requirements of <300ms with a 99.99% reliability over 1 minute. Partners with a critical dependency on the GHS call path found that even if the overall reliability was high, there were periods of time where the number of timeouts would exceed their tolerances and result in a noticeable degradation of the partner’s own SLA. These periods of increased timeouts were given the name “micro-outages” and key partners started tracking these daily.

After investigating many possible solutions, such as LevelDB, Kafka, MongoDB, and Cassandra, the Universal Store team chose to replace GHS’s Azure Table backend and the original cache in front of it with an Azure DocumentDB backend. GHS deployed a single DocumentDB collection with 600,000 request units per second deployed across four geographic regions where their partner teams had the biggest footprint. As a result of the switch of DocumentDB, GHS customers have seen p50 latencies under 30ms and a huge reduction in the number and scale of micro-outages. GHS’s availability has remained at or above 99.99% since the migration. In addition to the increase in service availability, overall latencies significantly improved as well for most of GHS call patterns.

Number of GHS micro-outages before and after DocumentDB migration


Microsoft Azure IoT Hub: How to handle the firehose from billions of IoT devices

Azure IoT Hub is a fully managed service that allows organizations to connect, monitor, and manage up to billions of IoT devices. IoT Hub provides reliable communication between devices, the a queryable store for device metadata and synchronized state information, and provides extensive monitoring for device connectivity and device identity management events. Since IoT Hub is at the ingestion point for the massive volume of writes coming from IoT devices across all of Azure, they needed a robust and scalable database in their backend.

IoT Hub provides device-related information, “device twins”, as part of its APIs that device and back ends can use to synchronize device conditions and configuration. A device twin is a JSON document that includes tags assigned to the device in the backend, a property bag of “reported properties” which include device configuration or conditions, and a property bag of “desired properties” that can be used to notify the device to perform a configuration change. The IoT Hub team choose Azure DocumentDB over Hbase, Cassandra, and MongoDB because DocumentDB provided functionality that the team needed like guaranteed low latency, elastic scaling of storage and throughput, provide high availability via global distribution, and rich query capabilities via automatic indexing.

IoT Hub stores the device twin data as JSON documents and performs updates based on the latest state reported by devices in near real-time. The architecture uses a partitioned collection that uses a partition key based on the device ID to elastically scale to handle massive volumes of writes. IoT Hub also uses Service Fabric to scale out devices across multiple servers, each server communicating with a 1-N DocumentDB partitions. This topology is replicated across each Azure region that IoT Hub is available.


Next steps

In this blog, we looked at a couple of first-party use cases of DocumentDB and how these Microsoft teams were able to utilize Azure DocumentDB to improve user experience, improve latency, and reliability of their services.