Abstract:
As the global IT landscape undergoes a pivotal shift, it becomes imperative to note Gartner's forecast, which indicates that over half of enterprise IT expenditures will be allocated to cloud technology by 2025, primarily leveraging cloud-native technologies for workload architecture. In tandem, we are witnessing a surge in the utilization of observability tools that facilitate deeper insights into system operations and performance metrics, thus allowing for proactive optimization and management of cloud applications. This massive paradigm shift presents a substantial opportunity for enhancing automation in cloud infrastructure, focusing on performance, resilience, and cost optimization. While AI offers a promising avenue to spearhead this transformation, the existing techniques predominantly concentrate on fixed workloads or hyperscalers. The burgeoning enterprise customer base, which may lack extensive SRE/IT expertise, finds itself grappling with the complex orchestration and management of cloud applications, as they do not have the means or the resources to develop or fine-tune these AI-based automation tools.
In this talk, I assert the need to democratize such automation solutions that “work for everyone”, thereby providing significant value to all cloud customers. I will describe our work on intelligent root cause identification (RCI) which can serve as a blueprint for democratizing other automation efforts. I will first describe the user burden on operationalizing RCI capabilities in the wild by elucidating the requirements and constraints of the problem. Keeping those constraints in mind, I will describe our intelligent RCI framework that discerns the causes behind observed performance degradation and failures. The developed framework uses causal and active learning, allowing it to function efficiently out-of-the-box while seamlessly integrating with application-specific technologies and architectures, thereby providing a significant improvement over contemporary approaches. I will also share insights into the productizing of our research and demonstrating its value with real customers.
Bio: Saurabh Jha's current research interests include design and assessment of self-driven autonomous systems such as self-driven Cloud/HPC and vehicles. His work is at the intersection of Machine Learning (with a particular interest in causal and generative models) and Systems (focusing on dependability). He is a Research Staff Member at IBM Research TJ Watson. He received his Ph.D. in 2021 and MS in 2016 in Computer Science from the University of Illinois at Urbana-Champaign (UIUC). He received his B.Tech from VIT University. He has recieved several awards, most notably the IBM Ph.D. fellowship (2020-2021), SAP Industry Ph.D. scholarship (2014), and the VIT University best outgoing student award (2014). His work has resulted several best paper awards and honorable mentions in prestigious international conferences such as ACM/IEEE SC 2020, IEEE ISSRE 2020, and ACM HPDC 2013.