Machine Learning Seminar: Fangzhou Wu, "Efficient Capability-Aware LLM Systems: Capability Modeling, Routing, and Load Balancing."."
Apr 17, 2026 2:00 - 3:15 pm

- Sponsor
- Research Area of Artificial Intelligence
- Speaker
- Fangzhou Wu
- Contact
- Weixin Chen
- weixinc2@illinois.edu
- Originating Calendar
- Siebel School Speakers Calendar
- Abstract: Large language models (LLMs) are increasingly deployed in real-world systems, but growing query volume and model diversity make it challenging to deliver high-quality responses under tight serving budgets. This talk presents a system-level perspective on building efficient capability-aware LLM systems. I will first discuss training-free online multi-LLM routing, where the goal is to assign each query to the model that best balances response quality and inference cost. I will present a training-free routing framework that achieves strong empirical performance while also providing theoretical guarantees. I will then turn to KV cache-aware load balancing, where routing decisions must jointly account for cache reuse and system workload. I will present a unified formulation and show how principled routing and randomized cache-eviction algorithms can substantially improve cache hit rate and latency. Finally, I will introduce Evidence-Calibrated Clustering (ECC), a capability modeling framework that combines prior semantics with posterior capability evidence to form capability-aware query clusters, enabling more accurate query-conditioned model capability inference.Bio: Fangzhou Wu is a third-year Ph.D. student at the University of Wisconsin–Madison. His research focuses on developing provably efficient algorithms for accelerating training and inference in foundation models and agents. More broadly, he is interested in bridging theoretical insights and practical system design for modern foundation-model-based applications. His recent work studies how to improve the efficiency and scalability of LLM systems through training-free multi-LLM routing and KV cache-aware load balancing.