Microsoft Research is at the forefront of solving and tackling cutting edge problems with technologies such as Machine Learning and Deep Neural Networks (DNN). These technologies employ next generation server infrastructure that span immense Windows and Linux cluster environments. Additionally, for DNNs, these application stacks don’t only involve traditional system resources (CPUs, Memory), but also graphic processing units (GPUs).
With a nontraditional infrastructure environment, the Microsoft Research Operations team needed a highly flexible, scalable, and Windows and Linux compatible service to troubleshoot and determine root causes across the full stack.
Enter Azure Log Analytics
Azure Log Analytics, a component of Microsoft Operations Management Suite, natively supports log search through billions of records, real time metric collection, and rich custom visualizations across numerous sources. These out of the box features paired with the flexibility of available data sources made Log Analytics a great option to produce visibility & insights by correlating across DNN clusters & components.
The following diagram illustrates how Log Analytics offers the flexibility for different hardware and software components to send real time data within a single Deep Neural Network cluster node.
1. Linux Server System Resource Monitoring
Deep Neural Networks traditionally run on Linux, and Log Analytics supports major Linux distributions as first class citizens. The OMS Agent for Linux was also recently made generally available, built on the open source log collector FluentD. By leveraging the Linux agent, we were able to easily collect system metrics at 10 second interval and all of our Linux logs without any customization effort.
2. NVIDIA GPU Information
The Log Analytics platform is also extremely flexible, allowing users to send data via a recently released HTTP POST API. We were able to write a custom Python application to retrieve data from their NVIDIA GPUs and unlock the ability to alert based off of metrics such as GPU Temperature. Additionally, these metrics can be visualized with Custom Views to create rich performance graphs for the team to further monitor.
Whoa, I’d love to learn more
We wrote this post to showcase the flexibility Log Analytics offers customers in the type of data sources that can onboard. Additionally, check out the full walkthrough on the MSOMS blog that includes Python code examples if you are interested in replicating this type of insight.
Finally, if you are completely new to Log Analytics be sure to try our fully hydrated demo environment located here, or sign up for a free Microsoft Operations Management Suite subscription so you can test out all these capabilities with your own environment.