close
close

Microsoft releases SuperBench: A groundbreaking proactive validation system to improve the reliability of cloud AI infrastructure and mitigate hidden performance degradation

Cloud AI infrastructure is critical to modern technology and forms the backbone for various AI workloads and services. Ensuring the reliability of these infrastructures is critical as any failure can cause widespread disruption, especially in large distributed systems where AI workloads are synchronized across numerous nodes. This synchronization means that a failure in one node can have cascading effects that amplify the impact and cause significant downtime or performance degradation. The complexity and scale of these systems make it essential to implement robust mechanisms to maintain their smooth operation and minimize incidents that could affect the quality of services provided to users.

One of the biggest challenges in maintaining cloud AI infrastructure is troubleshooting hidden performance degradations due to hardware redundancies. These subtle failures, often referred to as “gray failures,” do not cause immediate, catastrophic problems but gradually degrade performance over time. These issues are particularly problematic because they are not easily detectable using traditional monitoring tools that are typically designed to detect obvious binary failure conditions. The insidious nature of gray failures complicates the task of root cause analysis and makes it difficult for cloud providers to identify and fix the underlying issues before they develop into more serious problems that could impact the entire system.

Cloud providers have traditionally relied on hardware redundancies to mitigate these hidden issues and ensure system reliability. Redundant components, such as additional GPU compute units or oversized network connections, are intended to act as failovers. However, these redundancies can inadvertently bring their own problems. Over time, continuous and repeated use of these redundant components can lead to gradual performance degradation. For example, in Azure A100 clusters where InfiniBand top-of-rack (ToR) switches have multiple redundant uplinks, the loss of some of these links can lead to a reduction in throughput, especially under certain traffic patterns. This type of gradual degradation often goes unnoticed until it significantly impacts AI workloads, which is much harder to fix.

A team of researchers from Microsoft Research and Microsoft presented SuperBencha proactive validation system designed to improve the reliability of cloud AI infrastructure by addressing the hidden degradation problem. SuperBench performs a comprehensive evaluation of hardware components under realistic AI workloads. The system includes two main components: a validator that learns benchmark criteria to identify defective components, and a selector that optimizes the timing and extent of the validation process to ensure it is both effective and efficient. SuperBench can run various benchmarks representing most real-world AI workloads, detecting subtle performance degradations that might otherwise go unnoticed.

The technology behind SuperBench is sophisticated and tailored to the unique challenges presented by cloud AI infrastructures. SuperBench's Validator component runs a series of benchmarks on specific nodes and learns to distinguish between normal and faulty performance by analyzing the cumulative distribution of benchmark results. This approach ensures that even minor performance deviations that could indicate a potential problem are detected early. Meanwhile, the Selector component balances the trade-off between validation time and the potential impact of incidents. Using a probabilistic model to predict the likelihood of incidents, the Selector determines the optimal time to run specific benchmarks. This ensures that validation is performed when problems are most likely to be avoided.

The effectiveness of SuperBench is demonstrated by its deployment in the Azure production environment, where it was used to validate hundreds of thousands of GPUs. Through rigorous testing, SuperBench was shown to increase mean time between incidents (MTBI) by up to 22.61x. By reducing the time required for validation and focusing on the most critical components, SuperBench reduced the cost of validation time by 92.07% while increasing users' GPU hours by 4.81x. These impressive results underscore the system's ability to detect and prevent performance issues before they impact end-to-end workloads.

In summary, by detecting and resolving hidden performance degradations early, SuperBench provides a robust solution to the complex challenge of ensuring the continuous and reliable operation of large-scale AI services. The system's ability to detect subtle performance degradations and optimize the validation process makes it an indispensable tool for cloud service providers looking to improve the reliability of their AI infrastructures. With SuperBench, Microsoft has set a new standard for cloud infrastructure maintenance, ensuring that AI workloads can run with minimal disruption and maximum efficiency, helping to maintain high performance standards in a rapidly evolving technological landscape.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Þjórsárden and join our Telegram channel And LinkedInphew. If you like our work, you will Newsletters..

Don’t forget to join our 48k+ ML SubReddit

Find upcoming AI webinars here


Asif Razzaq is the CEO of Marktechpost Media Inc. A visionary entrepreneur and engineer, Asif strives to harness the potential of artificial intelligence for the greater good. His latest project is the launch of an artificial intelligence media platform, Marktechpost, which is characterized by its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable for a wide audience. The platform boasts of over 2 million views per month, which underlines its popularity among the audience.