Advith's Calendar

COLLOQUIUM: Manling Li, "Mechanistic Science of Multimodal Models"

Feb 18, 2026   3:30 pm  
HYBRID: 2405 Siebel Center for Computer Science or online
Sponsor
Siebel School of Computing and Data Science
Views
52
Originating Calendar
Siebel School Colloquium Series

Zoom: https://illinois.zoom.us/j/87333811015?pwd=GQaYckAikXihkxhqjLHO9bfA9vlHO8.1

Refreshments Provided.

Abstract: 
Multimodal alignment is often treated as a black-box training recipe. We argue that the key question is mechanistic: how does cross-modal alignment form inside the model, and when does it preserve geometry versus collapse into language-driven shortcuts? We study it as a representation learning problem: understand how alignment emerges in internal states, shape it with geometry structured priors, and control it to remain reliable. First, we open up multimodal models to probe the internal mechanisms of alignment and to intervene when geometry-relevant structure is preserved or lost. Second, we introduce VAGEN, framing multimodal representation learning as multi-stage prior injection: injecting vision and map/world-model priors so that latent states become structured abstractions that support “think as a map”, not just token-level matching. Finally, we present ODE-Steer, which controls alignment by steering internal activation into a safe subregion defined by control barrier functions, where representations and reasoning are controllable. True multimodal intelligence requires more than aligning tokens; it requires aligning the internal physics of the model with the geometry of the world.

Bio:
Multimodal alignment is often treated as a black-box training recipe. We argue that the key question is mechanistic: how does cross-modal alignment form inside the model, and when does it preserve geometry versus collapse into language-driven shortcuts? We study it as a representation learning problem: understand how alignment emerges in internal states, shape it with geometry structured priors, and control it to remain reliable. First, we open up multimodal models to probe the internal mechanisms of alignment and to intervene when geometry-relevant structure is preserved or lost. Second, we introduce VAGEN, framing multimodal representation learning as multi-stage prior injection: injecting vision and map/world-model priors so that latent states become structured abstractions that support “think as a map”, not just token-level matching. Finally, we present ODE-Steer, which controls alignment by steering internal activation into a safe subregion defined by control barrier functions, where representations and reasoning are controllable. True multimodal intelligence requires more than aligning tokens; it requires aligning the internal physics of the model with the geometry of the world.


Part of the Siebel School Speakers Series. Faculty Host: Heng Ji


Meeting ID: 873 3381 1015 
Passcode: csillinois


If accommodation is required, please email <erink@illinois.edu> or <communications@cs.illinois.edu>. Someone from our staff will contact you to discuss your specific needs



 

 

link for robots only