略過導覽

Microsoft Xbox 小組採用了 SRE 角色來打造遊戲串流

了解營運小組和開發人員如何成為可靠的夥伴,一同架構全球散發的 Kubernetes 部署。

挑戰為:建立全球規模的程序

Like many operations teams, the Xbox Reliability Engineering and Operations (xREO) team used to spend a lot of time performing repetitive, manual tasks to maintain data centers, deploy new code, and react to issues that arose from working in a monolithic, rigid architecture that they didn’t design. Their efforts kept the service running for millions of active monthly subscribers in more than 40 countries and regions. But when the team was tasked with supporting project xCloud, a game streaming experience with extreme low latency requirements for gamers around the world, it became clear that they needed to step outside their traditional service engineering role, break down team silos, and reinvent the way they worked.

"Even small changes posed a significant risk, which meant we spent a lot of our time firefighting. Our mode of operation was mostly reactive, and we weren't really empowered to do much about it."

Xbox 可靠性和營運 SRE PM James Whitesides

透過共同作業和自動化解決規模問題

Early in the project, the development team recognized that they needed to bring in xREO to help design and build a new architecture that would take advantage of the global reach of Azure. Starting with containers to decouple the service code from the infrastructure and Kubernetes as the obvious choice for orchestration, the teams selected the fully managed Azure Kubernetes Service (AKS) to eliminate a lot of the management complexity.

Yet even with this streamlined system, the volume of manual tasks required to build each Kubernetes cluster quickly overwhelmed the xREO team. For repeatability and automation, they decided to build a continuous integration/continuous delivery (CI/CD) pipeline with Azure Pipelines, using Azure Resource Manager templates to rapidly provision resources.

"Now, in the SRE role, we build the platform with the devs, and we are part of their deployment process. We're really focused on building and improving rather than burning down checklists."

Xbox 可靠性和營運 SRE PM James Whitesides

接受具有新任務的新角色

今日的 CI/CD 管線部署了超過 35 個 AKS 式微服務,這些微服務仰賴著數個 Azure 區域中每一區域超過 100 個以上的資源,且將來會持續增加。為了部署新區域,小組新增了六行程式碼,並等待資源開始運轉。

將部署完全自動化後,xREO 小組轉往站台可靠性工程 (SRE) 角色,並將大部分時間用來建立新工具,而非修正問題。xREO 小組常作為可靠的合作夥伴為開發小組提供諮詢,並注重主動、高價值且高回報的作業。

深入了解小組對 SRE 角色的轉換。

閱讀完整案例