How Skype modernized its backend infrastructure using Azure Cosmos DB – Part 1

Posted on April 8, 2019

Program Manager, Azure Cosmos DB

This is a three-part blog post series about how organizations are using Azure Cosmos DB to meet real world needs, and the difference it’s making to them. In this post (part 1 of 3), we explore the challenges Skype faced that led them to take action. In part 2, we’ll examine how Skype implemented Azure Cosmos DB to modernize its backend infrastructure. In part 3, we’ll cover the outcomes resulting from those efforts.

Note: Comments in italics/parenthesis are the author's.

Scaling to four billion users isn’t easy

Founded in 2003, Skype has grown to become one of the world’s premier communication services, making it simple to share experiences with others wherever they are. Since its acquisition by Microsoft in 2010, Skype has grown to more than four billion total users, more than 300 million monthly active users, and more than 40 million concurrent users.

People Core Service (PCS), one of the core internal Skype services, is where contacts, groups, and relationships are stored for each Skype user. The service is called when the Skype client launches, is checked for permissions when initiating a conversation, and is updated as the user’s contacts, groups, and relationships are added or otherwise changed. PCS is also used by other, external systems, such as Microsoft Graph, Cortana, bot provisioning, and other third-party services.

Prior to 2017, PCS ran in three datacenters in the United States, with data for one-third of the service’s 4 billion users represented in each datacenter. Each location had a large, monolithic SQL Server relational database. Having been in place for several years, those databases were beginning to show their age. Specific problems and pains included:

  • Maintainability: The databases had a huge, complex, tightly coupled code base, with long stored procedures that were difficult to modify and debug. There were many interdependencies, as the database was owned by a separate team and contained data for more than just Skype, its largest user. And with user data split across three such systems in three different locations, Skype needed to maintain its own routing logic based on which user’s data it needed to retrieve or update.
  • Excessive latency: With all PCS data being served from the United States, Skype clients in other geographies and the local infrastructure that supported them (such as call controllers), experienced unacceptable latency when querying or updating PCS data. For example, Skype has an internal service level agreement (SLA) of less than one second when setting up a call. However, the round-trip times for the permission check performed by a local call controller in Europe, which reads data from PCS to ensure that user A has permission to call user B, made it impossible to setup a call between two users in Europe within the required one-second period.
  • Reliability and data quality: Database deadlocks were a problem—and were exacerbated because data used by PCS was shared with other systems. Data quality was also an issue, with users complaining about missing contacts, incorrect data for contacts, and so on.

All of these problems became worse as usage grew, to the point that, by 2017, the pain had become unacceptable. Deadlocks were becoming more and more common as database traffic increased, which resulted in service outages, and weekly backups were leaving some data unavailable. “We did the best with what we had, coming up with lots of workarounds to deal with all the deadlocks, such as extra code to throttle database requests,” recalls Frantisek Kaduk, Principal .NET Developer on the Skype team. “As the problems continued to get worse, we realized we had to do something different.”

In addition, the team faced a deadline related to General Data Protection Regulation (GDPR); the system didn’t meet GDPR requirements, so there was a deadline for shutting down the servers.

The team decided that, to deliver an uncompromised user experience, it needed its own data store. Requirements included high throughput, low latency, and high availability—all of which had to be met regardless of where users were in the globe.

An event-driven architecture was a natural fit, however, it would need to be more than just a basic implementation that stored current data. “We needed a better audit trail, which meant also storing all the events leading up to a state change,” explains Kaduk. “For example, to handle misbehaving clients, we need to be able to replay that series of events. Similarly, we need event history to handle cross-service/cross-shard transactions and other post-processing tasks. The events capture the originator of a state change, the intention of that change, and the result of it.”

Continue on to part 2, which examines how Skype implemented Azure Cosmos DB to modernize its backend infrastructure.