Microsoft Xbox team embraces SRE role to build game streaming

See how the operations team and developers became trusted partners to architect a globally distributed Kubernetes deployment.

The challenge: Creating a process for global scale

Like many operations teams, the Xbox Reliability Engineering and Operations (xREO) team used to spend a lot of time performing repetitive, manual tasks to maintain data centres, deploy new code and react to issues that arose from working in a monolithic, rigid architecture that they didn’t design. Their efforts kept the service running for millions of active monthly subscribers in more than 40 countries and regions. However, when the team was tasked with supporting project xCloud, a game streaming experience with extreme low latency requirements for gamers around the world, it became clear that they needed to step outside their traditional service engineering role, break down team silos and reinvent the way they worked.

"Even small changes posed a significant risk, which meant we spent a lot of our time firefighting. Our mode of operation was mostly reactive, and we weren't really empowered to do much about it."

James Whitesides, SRE PM, Xbox Reliability and Operations

Solving for scale through collaboration and automation

Early in the project, the development team recognised that they needed to bring in xREO to help design and build a new architecture that would take advantage of the global reach of Azure. Starting with containers to decouple the service code from the infrastructure and Kubernetes as the obvious choice for orchestration, the teams selected the fully managed Azure Kubernetes Service (AKS) to eliminate a lot of the management complexity.

Yet, even with this streamlined system, the volume of manual tasks required to build each Kubernetes cluster quickly overwhelmed the xREO team. For repeatability and automation, they decided to build a continuous integration/continuous delivery (CI/CD) pipeline with Azure Pipelines, using Azure Resource Manager templates to rapidly provision resources.

"Now, in the SRE role, we build the platform with the devs, and we are part of their deployment process. We're really focused on building and improving rather than burning down checklists."

James Whitesides, SRE PM, Xbox Reliability and Operations

Taking on a new role with a new mission

Today, the CI/CD pipeline deploys more than 35 AKS-based microservices that rely upon more than 100 resources (per region) to numerous Azure regions, with more on the way. To deploy a new region, the team adds six lines of code and waits for the resources to spin up.

With deployment fully automated, the xREO team has shifted to a site reliability engineering (SRE) role and they spend most of their time creating new tooling instead of fixing problems. They’re frequently consulted as a trusted partner to the development team, and their focus is on proactive, high value and highly rewarding work.

Take a closer look at the team’s transformation to an SRE role.

Read the full story