Azure #CosmosDB: Case study around RU/min with the Universal Store Team

Publicado el 8 junio, 2017

Senior Program Manager

Karthik Tunga Gopinath, from the Universal Store Team at Microsoft, leveraged RU/min to optimize the entire workload provisioning with Azure Cosmos DB. He is sharing his experience on this blog.

Azure Cosmos DB Request unit per minute capability

The Universal Store Team (UST) in Microsoft is the platform powering all “storefronts” of Microsoft assets. These storefronts can be web, rich client, or brick mortar stores. This includes the commerce, fraud, risk, identity, catalog, and entitlements/license systems.

Our team, in the UST, streams user engagement of paid applications, such as games, in near real-time. This data is used by Microsoft Fraud systems to identify if a customer is eligible for refund upon request. The streaming data is ingested to Azure Cosmos DB. Today, this usage data along with other insights provided by the Customer Knowledge platform forms the key factors in the refund decision. Since this infrastructure is used for UST Self-Serve refund, it is imperative that we drive down the operation cost as much as possible.

We chose Azure Cosmos DB primarily for three reasons, guaranteed SLAs, elastic scale, and global distribution. It is crucial for the data store to keep up with the incoming stream of events with guaranteed SLAs. The storage needed to be replicated in order to serve refund queries faster across multiple regions with support for disaster recovery.

The problem

Azure Cosmos DB provides a predictable and low-latency performance backed by the most comprehensive SLA in the industry. Such performance requires capacity provisioning at the second granularity. However, by relying only on provisioning per second, cost was a concern for us as we had to provision for peaks.

For refund scenario, we need to store 2 TB of usage data. This coupled with the fact that we cannot fully control the write skew causes a few problems. The incoming data has temporal bursts due to various reasons such as new game releases, discounts on game purchases, weekday vs. weekends, etc. This would result in the writes being frequently throttled. To avoid being throttled during bursts, we needed to allocate more RUs. This over allocation proved to be expense since we didn’t use all the RUs allocated majority of the time. 

Another reason we allocate more RUs is to decrease our Mean Time to Repair (MTTR). This is primary when trying to catchup to the current stream of metric events after a delay or failure. We need to have enough capacity to catchup as soon as possible. Currently, the platform has between 2,500 to 4,000 writes/sec. In theory, we only need 24K RUs/second, each write is 6 RUs given our document size. However, because of the skew it is hard to predict when a write will happen on which partition. Also, the partition key we used is designed for extremely fast read access for good customer experience during self-service refund.

Request Units per Minute (RU/m) Stress Test Experiment

To test RU/m, we designed the catchup or failure recovery test in our dev environment. Previously we allocated 300K RU/s for the collection. We enabled RU/m and reduced compute capacity from 300K RU/second to 100K RU/second. This gave us extra 1M RU/m. To push our writes to the limit and test our catchup scenario we simulated an upstream failure. We stopped streaming for about 20 hours. We then started streaming backlog data and observed if the application could catchup with the lower RU/s plus additional RU/m. The dev environment also had the same load as we see in production.

Data Lag Graph

Data lag graph

RU/min usage

RU-min usage

The catchup test is our worst-case workload. The first graph shows the lag in streaming data. We see that overtime the application can catchup and reduce the lag to near zero (real-time). From the second graph we see that RU/m is utilized when we start the catchup test. It shows that the load was beyond our allocated RU/s which is our desired outcome for the test. The RU/m usage is between 35K to 40K until we catchup. This is the expected behavior since the peak load is on Azure Cosmos DB is during this period. The slow drop in RU/m usage is because the catchup is near completion. Once data is close to near real-time, the application doesn’t need all the extra RU/m since the RU/s provisioned is enough to meet the throughput requirements for normal operations most of the time.

RU/m usage during normal conditions

As mentioned above, during normal operations the streaming pipeline requires only 24k RU/s. However, because we might have a lot of activity on a specific partition (“Hot Spot”), each partition can have unexcepted capacity need. Looking at the graph below, you can see sporadic RU/m consumption as RU/m is still used for non-peak load. Such hot spots could happen for numerous reasons that were mentioned above. Also, we noticed that the application did not experience any throttling during the entire test period. Previously we allocated 300k RU/s to handle these bursts. Now with RU/m, we only need to provision 100k RU/s. RU/m also helped us during our normal operation, not just during peak load.

RU/m usage during normal operations

RU per minute usage during normal operations

Results

The above experiments proved that we could leverage RU/m and lower RU/s allocation while still handling peak load. By leveraging RU/m we could reduce our operation cost by more than 66%.

Next steps

The team is actively working on ways to reduce the write skews and working with Azure Cosmos DB team to get the most out of RU/m.

Resources

Our vision is to be the most trusted database service for all modern applications. We want to enable developers to truly transform the world we are living in through the apps they are building, which is even more important than the individual features we are putting into Azure Cosmos DB. We spend limitless hours talking to customers every day and adapting Azure Cosmos DB to make the experience truly stellar and fluid. We hope that RU/m capability will enable you to do more and will make your development and maintenance even easier!

So, what are the next steps you should take?

  • First, understand the core concepts of Azure Cosmos DB.
  • Learn more about RU/m by reading the documentation:
    • How RU/m works
    • Enabling and disabling RU/m
    • Good use cases
    • Optimize your provisioning
    • Specify access to RU/m for specific operations
    • Visit the pricing page to understand billing implications

If you need any help or have questions or feedback, please reach out to us through askcosmosdb@microsoft.com. Stay up-to-date on the latest Azure Cosmos DB news (#CosmosDB) and features by following us on Twitter @AzureCosmosDB and join our LinkedIn Group.

About Azure Cosmos DB

Azure Cosmos DB started as “Project Florence” in the late 2010 to address developer pain-points faced by large scale applications inside Microsoft. Observing that the challenges of building globally distributed apps are not a problem unique to Microsoft, in 2015 we made the first generation of this technology available to Azure Developers in the form of Azure DocumentDB. Since that time, we’ve added new features and introduced significant new capabilities. Azure Cosmos DB is the result. It represents the next big leap in globally distributed, at scale, cloud databases.