Advancing a culture of reliability at the pace of Azure

31 Ağustos, 2020 tarihinde gönderildi

Chief Technology Officer, Microsoft Azure

“Customers value cloud services because they are agile and adaptable, scaling and transforming to meet the changing needs of business. Since the velocity of change can work against the tenets of reliability, our Azure engineering teams have evolved their culture, processes, and frameworks to balance the pace of innovation with assurance of performance and quality. Today, I asked Principal Program Manager Anne Hamilton to explore the challenges of developing a culture of reliability through Azure engineering onboarding skills training, as part of our Advancing Reliability blog series.” —Mark Russinovich, CTO, Azure


 

Like engineering reliability, Azure culture must balance the speed of the new with the stability of the known in the face of tremendous growth and unknowns. New hires bring new ideas and perspectives while veterans bring experience and institutional knowledge. Both contribute to the team culture, which defines how quality and innovation are valued and implemented.

To evolve the best quality outcomes, the Azure engineering team culture must be a place where ideas are openly shared, rigorously challenged, and effectively implemented. It’s a space where ideation and creativity thrive.

Skills, processes, and frameworks can be taught. But can culture be taught? How do you onboard new hires into a culture that values reliability?

Like so much about Azure, onboarding individuals and developing team culture at the speed of Azure has been fraught with challenges, and rich in learnings.

Onboarding engineers—Azure Engineering Boot Camp

The astounding growth rate of cloud computing has created an unprecedented demand for engineering roles worldwide. Within the Cloud + AI team, this results in hundreds of new hires joining Azure engineering teams each week. Hundreds of people to train on crucial skills, internal tools, and best practices. And hundreds of people to experience their first exposure to culture. How do you preserve institutional knowledge and disseminate culture that values reliability when waves of new people are onboarding constantly?

When a new engineer joins Microsoft, they spend their first day at New Employee Orientation (NEO), their first week getting familiar with their team and the environment, and their second week in the Azure Engineering Boot Camp (ABC).

ABC students delve into hands-on labs, learn Azure tools and services, and participate in lectures and activities to explore the engineering and business strategy. Many reliability principles are explicitly taught, including systems thinking, adaptive leadership, valuing diversity and inclusion, and customer development. For example, the “Systems Thinking” session looks at the tensions between feature velocity and quality, exploring the impact of whether and when adding more developers to a problem enhances feature velocity without compromising quality. But it is the way that these are presented that enables the principles to stick. The immersive week of in-person training enables trainers to model, coach, and help students develop reliability culture attributes along with technical skills in real-time.

New engineers at a pre-COVID-19 ABC event.

A cohort of new engineers at a pre-COVID-19 ABC event.

While it’s not written anywhere in the agenda, ABC instructors place a high value on teaching culture. They understand that Microsoft needs new hires to step into their teams with more than technical skills. New hires need both the confidence to question the status quo, and a caring commitment to the team’s cultural dynamic. So that participants develop this expectation when they first arrive, ABC creates a shared experience by encouraging belonging and valuing diversity, and by inviting participants to challenge the curriculum and share personal experiences that may differ from those presented. ABC is a special place, creating a cohort of learners able to explore themselves in the context of the culture. Participants examine their authentic selves and find balance as they step into Azure.

Like the rest of Microsoft Azure, ABC is maturing and adapting to learner needs to improve its relevance as Azure scales. To offer a deeper sense of the program, here are three problems and the current thinking around them.

Problem 1: Enabling global parity

Some of the tenets that have made ABC in Redmond successful are a challenge for the global organization to replicate. Yet thinking holistically and inclusively are core principles of reliability engineering. In Redmond, ABC’s instructor-led, engaged cohort environment has been an essential aspect of enabling a culture of reliability for new hires, but is culture in Redmond the same as culture in Europe? In India? How can ABC hold to a global standard and achieve local relevance?

The team explored video options, but soon realized that Redmond-built video training would have opposite the desired effect on creating a vibrant engineering culture. Instead, the team worked with local stakeholders to set up regional training in the major development centers. The result is a ‘crawl, walk, run’ approach in which the regional centers could mature their Azure engineering education program at their own pace. The resulting global improvements to ABC are multiple, including an improved lab experience, more accurate and timely technical curriculum updates, and the diversity of contributors improves reliability outcomes for all.

One example is datacenter employee participation in ABC. This group tends to be early in career and/or from community colleges, with expertise in infrastructure engineering and operations as opposed to software engineering. The datacenter staff see ABC as a rare opportunity to connect with HQ teams, and they bring a real working experience to the cohort, most of whom won’t yet have had the chance to visit a datacenter. Having these groups learn together and share real-world datacenter experiences along with HQ insights helps both audiences grow.

Crawl, walk, run levels

At the ‘crawl’ level, events happen quarterly. A regional center hosts the ABC trainers and offers keynote speakers from the local leadership team, lunchtime lectures by local engineering teams describing their latest features and services, and a cohort of local veteran engineers to mentor students through capstone architecture design projects. The upside is that students are steeped in a mix of both local and global Azure culture during an event. The downside is that it happens only during ABC events, which are infrequent. An example deployment is the India Development Center in Hyderabad.

At the ‘walk’ level, a local program manager (PM) oversees a cohort of volunteer trainers to teach sessions and to seed the local flavor of Azure engineering culture. The local PM recruits keynote speakers, lecturers, and mentors. They manage communications, related communities, marketing to deepen the connections between new hires and the established engineering team. The benefit is that students have a deeper exposure to local Azure culture and a deeper relationship with local leadership. The PM’s work extends cultural exposure beyond ABC and invites alumni into more events and participation. An example deployment is the Israel Development Center in Herzliya.

 

At the ‘run’ level, two local FTE trainers who report into the Redmond team teach ABC, engage local speakers and lecturers to regionalize the program, and serve as ambassadors for Azure Engineering Learning all up so that the region is fully engaged in training events beyond ABC. This enables more cross-pollination of culture, with learning programs woven deeply in the fabric of the team. Learning culture becomes less event-based and more of an organic part of each day. An example deployment is the Irish Development Center in Dublin.

Dublin ABC trainers Paul Mooney and Jyoti Gupta.

Dublin ABC FTE trainers Paul Mooney and Jyoti Gupta ready for their next session in December 2019.

Admittedly, it’s a progressive journey with learnings and improvements possible at all levels. The regions continue to share approaches, incubate, and experiment to continue learning and growth.

Problem 2: Personalizing factory education

By intent, a boot camp is rooted in the utilitarian premise of industrial age education in which a large body of students are brought to a specific level of knowledge and skill. In the military context, boot camp enables people from a broad swath of backgrounds, fitness levels, and skills to achieve a standard baseline of learning and physical outcomes. While ideal for teaching factory workers or soldiers how to read an instruction manual, it fails at tailoring education for individual skills development and does little for building the constructivist thinking necessary in a culture of reliability. How can an intensive onboarding program provide a tailored learning experience at scale?

ABC is sensitive to this need, but resolution is a work in progress. The curriculum is intended to provide students with a survey of the tools, technologies, and best practices they will encounter early and often in their Azure career. It focuses on providing an essential overview, a brief practical experience, and self-paced resources for continued learning. After ABC, students are expected to define a personal course of learning that integrates additional training events on offer as well as self-paced resources.

Anecdotally we hear that ABC graduates are more confident about making suggestions and assertions earlier in role because they experienced that behavior being modeled and valued in the ABC classroom. Trainers use stories from real world student experiences to make reliability theory more concrete. ABC PM and Trainer Devidas Gupta comments, “The best learning is when I get to stand aside and enable students to share and discuss with each other with minimal guidance or facilitation from me. That’s when I know they’ll walk out of the learning experience and into their team with confidence and willingness to engage with others effectively.”

A two-speed audience

Student critiques of ABC fall roughly into two groups. Some early-in-career types feel the pace is too fast and doesn’t provide enough context and examples, while seasoned engineers say that the curriculum is too basic. They ask for more and deeper coverage of advanced topics.

On the surface, it would seem that dividing the program into basic and advanced cohorts with appropriate topics and pacing for each would be the best solution when resources and logistics can support it. Historically, ABC has taken a different approach to build a more resilient growth culture, as both groups benefit from learning together. Early in career engineers can learn from those more experienced, and seasoned professionals can get new perspectives and approaches from recent college graduates. The classroom becomes a crucible for team cultural experience.

The instructor-led format requires that instructors seek to understand the makeup of each cohort so that they can adjust the pace, add context, or explore the edges of the curriculum to match the learning experience to the cohort needs. Instructors come early, stay late, and often engage students after ABC completion to mentor them, connect them with subject matter experts (SMEs), or offer guidance and resources. This deep and ongoing engagement is what makes ABC so special.

Piloting new approaches

Recent customer developments are enabling large-scale hiring in the East Coast of the U.S. These new recruits mostly come from the early in career background and need a more DevOps-focused curriculum. This gives ABC the opportunity to slow the pace and change topic breadth while keeping culture central to the experience. The intent is to extend the learning time frame to offer the basics in greater context, then connect the cohort with more seasoned new hires to broaden the experience. First piloted this spring and fine-tuned over the summer, the curriculum is now in full rollout.

Problem 3: Creating culture at a distance during COVID-19

Current COVID-19 restrictions have made live instructor-led events impossible. In its current format, a video recording of ABC would be a much-diminished learning experience, and most of the cultural dynamic would be lost. Yet new recruits are joining each week and need to be onboarded and trained. How can we bring culture to new hires during this pandemic? How do we build new capabilities to prepare for the future?

Immediate need

ABC is working closely with the NEO team to make sure that new hires joining have the best possible onboarding experience. To build culture, ABC is exploring assigning learning mentors called “camp counselors” to work with small cohorts of new hires and guide them through available self-paced learning options that roughly align to ABC.

ABC trainer piloting a video-based course.

ABC FTE trainer, Tim Colbert, pilots a video-based course.

Short term solution

ABC has pivoted to a virtual video series for self-paced learning. The videos focus on the technical skills and demos and are supported by Microsoft Teams-based office hour discussions, hosted by instructors and topic SMEs. Camp counselors act as learning mentors to support the experience. This way, students have a live venue for connecting with mentors, instructors, and other students to garner familiarity with the culture and have some personalized support despite the lost classroom experience.

Long term approach

Based on student response, ABC will learn, adapt, and scale video and teams-based training approaches, sensitive to how they model, support, and grow Azure’s culture of reliability. ABC will continue to pilot new approaches and build more capacity for additional training experiences. Beyond ABC, reliability is covered in the “Implementing SLOs” course and several “Cloud Talks.”

Looking forward

The core of Azure reliability starts with its engineering team culture. Teams need to be a place where ideation is safe, dissention is explored, and passion for quality is foremost. Just as a high velocity of change can impede system reliability, the pace of new hires can disrupt team cultural development. It is critical for onboarding training to model and develop the desired cultural attributes, yet this becomes increasingly challenging at scale, whether that scale is global, across learning levels, or in the face of a pandemic. ABC training will continue to pilot new approaches and evolve better solutions to ensure all engineers can effectively participate in improving reliability from their earliest days in role.

Lessons learned

  • Reliability engineering is rooted in team cultural dynamics. When the culture of an engineering team values the attributes that enable reliability, reliability is a part of every action.
  • During onboarding, don’t miss the opportunity to steep new hires in the organizational culture. Model reliability culture while teaching technical skills to better prepare new hires for the day to day experience.
  • When scaling globally, enable local ownership and contributions so the cultural and technical elements of reliability are locally compelling and relevant.
  • Treat onboarding just like an engineering pipeline with careful customer development, frequent experimentation, and agile iteration to improve relevance and build resiliency.

Read our Advancing Reliability blog posts to learn more.