跳过导航

Microsoft Xbox 团队采纳 SRE 角色来构建游戏流媒体

了解运营团队和开发人员是如何成为受信任的合作伙伴以构建全球分布式 Kubernetes 部署的。

阅读完整案例

挑战:创建迈向全球的进程

Like many operations teams, the Xbox Reliability Engineering and Operations (xREO) team used to spend a lot of time performing repetitive, manual tasks to maintain data centers, deploy new code, and react to issues that arose from working in a monolithic, rigid architecture that they didn’t design. Their efforts kept the service running for millions of active monthly subscribers in more than 40 countries and regions. But when the team was tasked with supporting project xCloud, a game streaming experience with extreme low latency requirements for gamers around the world, it became clear that they needed to step outside their traditional service engineering role, break down team silos, and reinvent the way they worked.

"Even small changes posed a significant risk, which meant we spent a lot of our time firefighting. Our mode of operation was mostly reactive, and we weren't really empowered to do much about it."

James Whitesides,Xbox 可靠性与运营团队 SRE 项目经理

通过协作和自动化解决规模问题

Early in the project, the development team recognized that they needed to bring in xREO to help design and build a new architecture that would take advantage of the global reach of Azure. Starting with containers to decouple the service code from the infrastructure and Kubernetes as the obvious choice for orchestration, the teams selected the fully managed Azure Kubernetes Service (AKS) to eliminate a lot of the management complexity.

Yet even with this streamlined system, the volume of manual tasks required to build each Kubernetes cluster quickly overwhelmed the xREO team. For repeatability and automation, they decided to build a continuous integration/continuous delivery (CI/CD) pipeline with Azure Pipelines, using Azure Resource Manager templates to rapidly provision resources.

"Now, in the SRE role, we build the platform with the devs, and we are part of their deployment process. We're really focused on building and improving rather than burning down checklists."

James Whitesides,Xbox 可靠性与运营团队 SRE 项目经理

担任具有新使命的新角色

目前,CI/CD 管道向众多 Azure 区域部署了超过 35 个基于 AKS 的微服务(每个区域的微服务依赖于 100 多个资源),且还有更多部署正在进行中。要部署新区域,该团队需添加 6 行代码并等待资源加载。

而凭借完全自动化部署,xREO 团队已转型为站点可靠性工程 (SRE) 角色,他们的大多数时间用来创建新的工具,而不是解决问题。人们经常向他们咨询问题,将他们看作是开发团队信任的合作伙伴,而他们的重心也放在了具有前瞻性、高价值和高回报的工作之上。

详细了解团队到 SRE 角色的转型。

阅读完整案例