How Microsoft builds massively scalable services using Azure Cosmos DB

Postado em 20 abril, 2017

Principal Program Manager, Azure Cosmos DB

https://docs.microsoft.com/en-us/azure/documentdb/documentdb-nosql-local-emulatorAs of May 10th 2017,

Azure Cosmos DB is Microsoft’s globally distributed multi-model database. Azure Cosmos DB was built from the ground up with global distribution and horizontal scale at its core. It offers turnkey global distribution across any number of Azure regions by transparently scaling and replicating your data wherever your users are. Elastically scale throughput and storage worldwide, and pay only for the throughput and storage you need. Azure Cosmos DB guarantees single-digit-millisecond latencies at the 99th percentile anywhere in the world, offers multiple well-defined consistency models to fine-tune performance, and guarantees high availability with multi-homing capabilities—all backed by industry leading service level agreements (SLAs). 

Azure Cosmos DB is truly schema-agnostic; it automatically indexes all the data without requiring you to deal with schema and index management. It’s also multi-model, natively supporting document, key-value, graph, and column-family data models. With Azure Cosmos DB, you can access your data using APIs of your choice, as DocumentDB SQL (document), MongoDB (document), Azure Table Storage (key-value), and Gremlin (graph) are all natively supported.


This week at Microsoft Data Amp we covered how you can harness the incredible power of data using Microsoft’s latest innovations in its Data Platform. One of the key pieces in the Data Platform is Azure Cosmos DB, Microsoft’s globally distributed multi-model database service. Azure Cosmos DB is used virtually ubiquitously as a backend for first-party Microsoft services for many years.

In this blog, we cover case studies of first-party applications of Azure Cosmos DB by the Windows, Universal Store, and Azure IoT Hub teams, and how these teams could harness the scalability, low latency, and flexibility benefits of Azure Cosmos DB to innovate and bring business value to their services.

Microsoft DnA: How Microsoft uses error reporting and diagnostics to improve Windows

The Windows Data and Analytics (DnA) team in Microsoft implements the crash reporting technology for Windows. One of their components runs as a Windows Service in every Windows device. Whenever an application stops responding on a user's desktop, Windows collects post-error debug information and prompts the user to ask if they’re interested in finding a solution to the error. If the user accepts, the dump is sent over the Internet to the DnA service. When a dump reaches the service, it is analyzed and a solution is sent back to the user when one is available.

Windows error reporting diagnostic information

clip_image002

 

Windows' need for fast key-value lookups

In DnA’s terminology, crash reports are organized into “buckets”. Each bucket is used to classify an issue by key attributes such as Application Name, Application Version, Module Name, Module Version, and OS Exception code. Each bucket contains crash reports that are caused by the same bug. With the large ecosystem of hardware and software vendors, and 15 years of collected data about error reports, the DnA service has over 10 billion unique buckets in its database cluster.

One of the DnA team’s requirements was rather simple at face value. Given the hash of a bucket, return the ID corresponding to its bucket/issue if one was available. However, the scale posed interesting technical challenges. There was a lot of data (10 billion buckets, growing at 6 million a day), high volume of requests and global reach (requests from any device running Windows), and low latency requirements (to ensure a good user experience).

To store “Bucket Dimensions”, the DnA team provisioned a single Cosmos DB container with 400,000 request units per second of provisioned throughput. Since all access was by the primary key, they configured the partition key to be the same as the “id”, with a digest of the various attributes as the value. As Azure Cosmos DB provided <10 ms read latency and <15ms write latency at p99, DnA could perform fast lookups against buckets and lookup issues even as their data and request volumes continued to grow over time.

Windows cab catalog metadata and query

Aside from fast real-time lookups, the DnA team also wanted to use the data to drive engineering decisions to help improve Microsoft and other vendors’ products by fixing the most impactful issues. For example, the team has observed that addressing the top 1 percent of reliability issues could address 50 percent of customers’ issues. This analysis required storing the crash dump binary files, “cabs”, extracting useful metadata, then running analysis and reports against this data. This presented a number of interesting challenges on its own.

  • The team deals with approximately 600 different types of reliability-incident data. Managing the schema and indexes required a significant engineering and operational overhead on the team.
  • The cab metadata was also a big volume of data. There were about 5 billion cabs, and 30 million new cabs were added every day.

The DnA team could migrate their Bucket Dimension and Cab Catalog stores to Azure Cosmos DB from their earlier solution based on an on-premises cluster of SQL Servers. Since shifting the database’s heavy lifting to Cosmos DB, DnA benefited from the speed, scale, and flexibility offered by Cosmos DB. More importantly, they could focus less on maintenance of their database and more on improving user experience on Windows.

clip_image004

You can read the case study at Microsoft’s DnA team achieves planet-scale big-data collection with Azure Cosmos DB.

Microsoft Global Homing Service: How Xbox Live and Universal Store build highly available location services

Microsoft’s Universal Store team implements the e-commerce platform that is used to power Microsoft’s storefronts across Windows Store, Xbox, and a large set of Microsoft services. One of the key internal components in the Universal Store backend is the Global Homing Service (GHS), a highly reliable service that provides its downstream consumers with the ability to quickly retrieve location metadata associated with one to many, arbitrary large number of, IDs.

Global Homing Service (GHS) using Azure Cosmos DB across 4 regions

GLS

GHS is on a hot path for the majority of its consumer services and receives hundreds of thousands of requests per second. Therefore, the latency and throughput requirements for the service are strict. The service had to maintain 99.99% availability and predictable latencies under 300ms end-to-end at the 99.9th percentile to satisfy requirements of its partner teams. To reduce latencies, the service is geo-distributed so that it is as close as possible to calling partner services.

The initial design of GHS was implemented using a combination of Azure Table Storage and various levels of caches. This solution worked well for the initial set of loads, but given the critical nature of GHS and increased adoption of the service from key partners, it became apparent that the existing SLA was not going to meet their partners’ P99.9 requirements of <300ms with a 99.99% reliability over 1 minute. Partners with a critical dependency on the GHS call path found that even if the overall reliability was high, there were periods of time where the number of timeouts would exceed their tolerances and result in a noticeable degradation of the partner’s own SLA. These periods of increased timeouts were given the name “micro-outages” and key partners started tracking these daily.

After investigating many possible solutions, such as LevelDB, Kafka, MongoDB, and Cassandra, the Universal Store team chose to replace GHS’s Azure Table backend and the original cache in front of it with an Azure Cosmos DB backend. GHS deployed a single Cosmos DB collection with 600,000 request units per second deployed across four geographic regions where their partner teams had the biggest footprint. As a result of the switch to Azure Cosmos DB, GHS customers have seen p50 latencies under 30ms and a huge reduction in the number and scale of micro-outages. GHS’s availability has remained at or above 99.99% since the migration. In addition to the increase in service availability, overall latencies significantly improved as well for most of GHS call patterns.

Number of GHS micro-outages before and after Azure Cosmos DB migration

clip_image006

Microsoft Azure IoT Hub: How to handle the firehose from billions of IoT devices

Azure IoT Hub is a fully managed service that allows organizations to connect, monitor, and manage up to billions of IoT devices. IoT Hub provides reliable communication between devices, the a queryable store for device metadata and synchronized state information, and provides extensive monitoring for device connectivity and device identity management events. Since IoT Hub is at the ingestion point for the massive volume of writes coming from IoT devices across all of Azure, they needed a robust and scalable database in their backend.

IoT Hub provides device-related information, “device twins”, as part of its APIs that device and back ends can use to synchronize device conditions and configuration. A device twin is a JSON document that includes tags assigned to the device in the backend, a property bag of “reported properties” which include device configuration or conditions, and a property bag of “desired properties” that can be used to notify the device to perform a configuration change. The IoT Hub team choose Azure Cosmos DB over Hbase, Cassandra, and MongoDB because Cosmos DB provided functionality that the team needed like guaranteed low latency, elastic scaling of storage and throughput, provide high availability via global distribution, and rich query capabilities via automatic indexing.

IoT Hub stores the device twin data as JSON documents and performs updates based on the latest state reported by devices in near real-time. The architecture uses a partitioned collection that uses a partition key based on the device ID to elastically scale to handle massive volumes of writes. IoT Hub also uses Service Fabric to scale out devices across multiple servers, each server communicating with a 1-N Azure Cosmos DB partitions. This topology is replicated across each Azure region that IoT Hub is available.

clip_image001

Next steps

In this blog, we looked at a couple of first-party use cases of Azure Cosmos DB and how these Microsoft teams were able to utilize Azure Cosmos DB to improve user experience, improve latency, and reliability of their services.