Research Seminars @ Illinois

Tailored for undergraduate researchers, this calendar is a curated list of research seminars at the University of Illinois. Explore the diverse world of research and expand your knowledge through engaging sessions designed to inspire and enlighten.

To have your events added or removed from this calendar, please contact OUR at ugresearch@illinois.edu

Machine Learning Seminar: Fangzhou Wu, "Efficient Capability-Aware LLM Systems: Capability Modeling, Routing, and Load Balancing."."

Apr 17, 2026   2:00 - 3:15 pm  
Sponsor
Research Area of Artificial Intelligence
Speaker
Fangzhou Wu
Contact
Weixin Chen
E-Mail
weixinc2@illinois.edu
Originating Calendar
Siebel School Speakers Calendar
Abstract: Large language models (LLMs) are increasingly deployed in real-world systems, but growing query volume and model diversity make it challenging to deliver high-quality responses under tight serving budgets. This talk presents a system-level perspective on building efficient capability-aware LLM systems. I will first discuss training-free online multi-LLM routing, where the goal is to assign each query to the model that best balances response quality and inference cost. I will present a training-free routing framework that achieves strong empirical performance while also providing theoretical guarantees. I will then turn to KV cache-aware load balancing, where routing decisions must jointly account for cache reuse and system workload. I will present a unified formulation and show how principled routing and randomized cache-eviction algorithms can substantially improve cache hit rate and latency. Finally, I will introduce Evidence-Calibrated Clustering (ECC), a capability modeling framework that combines prior semantics with posterior capability evidence to form capability-aware query clusters, enabling more accurate query-conditioned model capability inference. 

Bio: Fangzhou Wu is a third-year Ph.D. student at the University of Wisconsin–Madison. His research focuses on developing provably efficient algorithms for accelerating training and inference in foundation models and agents. More broadly, he is interested in bridging theoretical insights and practical system design for modern foundation-model-based applications. His recent work studies how to improve the efficiency and scalability of LLM systems through training-free multi-LLM routing and KV cache-aware load balancing.
link for robots only