Over the past few years, SONiC (Software for Open Networking in the Cloud), our open switch OS, has been in the fast lane. A diverse group of community partners have actively engaged with us to contribute and support the evolvement of the software.
SONiC is considered a live organism, always evolving. Microsoft and the community is developing, refining, and making SONiC freely available to anyone running global scale or cloud-type networks or just have a healthy interest in advanced networking.
Being in control of the network fabric and particularly having a hardware agnostic approach across larger heterogenous networks is critical. SONiC was created to provide those foundational attributes we ourselves needed when we set out to build our global network which powers both Azure and our other cloud services.
Recently, SONiC has received several enhancements and updates, along with additions to the ecosystem contributing to SONiC’s success.
Let’s take a look at what is new.
Global support now available
We are excited to see SONiC and its sibling SAI (Switch Abstraction Interface) being adopted by many global network innovators. Recently, both Dell EMC and Mellanox announced that SONiC will feature as switch OS options for customers using their respective hardware and on top of this, they will both offer global support services for this new combination.
“Dell EMC Networking has led the industry to an open networking paradigm, allowing customers to transform their IT operations on their own terms. Having SONiC as a validated option on our Open Networking hardware is a no-brainer as customers increasingly look to open source to gain even more flexibility and the scale to power massive cloud-based services such as Azure,” said Tom Burns, Senior Vice President and General Manager, Dell EMC Networking & Solutions.
With the introduction of enterprise-class support services, SONiC is moving quickly into mainstream territory. Users will now be able to take advantage of SONiC and its open source benefits, while at the same time enjoying high-quality global support. We look forward to welcome more contributors and maintainers offering similar services, as the SONiC community continues to grow.
“Mellanox is a leading SONiC contributor offering years of experience in high speed networking, large system validation and Open Source. Our customers are looking for open systems based on Spectrum and Spectrum-2 that are easily tailored to the changing and demanding needs of the cloud. Mellanox is in the unique position to globally support advanced services from networking, scalable high-performance computing, advanced storage transport and Datacenter switching,” said Amir Preacher Executive Vice President Business Development, Mellanox Technologies.
Our decade long experience with running the global network powering both Azure and our other cloud services, have taught us many things. A key area crucial to operations at scale, remains to be monitoring and telemetry, which we have worked into and refined for SONiC over the years.
Pinpoint network issues with Everflow
Everflow is the brainchild of Microsoft researchers and networking engineers. It can be very difficult to diagnose tricky network issues such as packet drops, random latency spikes and network loops, and it’s not unusual for networking engineers to spend hours and even days working on pinpointing and resolving such problems.
Everflow is a packet-level telemetry system with unprecedented level of detail. Everflow adds a timestamp to packets at each stop and mirrors packets to a centralized collector for deep analysis. For Azure, Everflow is one of the most powerful tools we use to diagnose packet loss and latency issues in our global network.
Figure 1. Everflow concept chart
No more resource overflows
A network switch has limited capacity to store all the various rules that make up the foundation of routing packets across a network. As the size of the networks increase, it is quite common to reconfigure the network to adopt more rules, routes, and nodes along the way. Through our own experience in building and running cloud scale networks, we have seen critical incidents caused simply by resource usage exceeding the limit of the ASIC, or brain of the switch.
To ease the pain and help address overflows, we have added Critical Resource Monitoring (CRM) to SONiC. It allows your switch to proactively flag and proactively issue alerts if strain on any resource is approaching its thresholds. Further, it enables a network engineer to proactively query the current state of any critical resource in the network. With CRM, critical configuration changes can be handled with safe taps on all resources.
Richer telemetry collection
Traditional network management systems are typically based on polling switch equipment to acquire telemetry via SNMP. This type of pull based telemetry is very inefficient, and hard to scale.
When running online services such as Azure, the dependency on network health and the ability to quickly resolve is obviously paramount. So, naturally we decided to build a more efficient way to monitor network states.
SONiC provides the data we need using streaming telemetry. In short, this means our hardware is proactively pushing real-time, structured, and analytics-ready data to the management system. Streaming telemetry greatly enriches the way we collect performance data. Its a natural fit to SONiC’s unique architecture by merely adding a containerized module with a centralized Redis database where telemetry data can be directly read out. The gRPC feature that streaming telemetry is based on was contributed to SONiC by community member Alibaba Group.
Figure 2. SONiC streaming telemetry through gRPC
The features presented in this blog are all great examples demonstrating SONiC’s powerful ability to run diagnostics, prevent network failures, and provide fast and flexible telemetry. As online services and application requirements become more and more sophisticated, the SONiC team at Microsoft and our highly committed developer community will continue to build, innovate, and refine the software, making the learnings we gather from building and operating the most reliable network in public cloud available to you. Please stay tuned for future updates.
The first SONiC workshop in China was hosted on October 19, 2018. Presentations are available at the SONiC GitHub repository.