COLLOQUIUM: Manling Li, "Mechanistic Science of Multimodal Models"

- Sponsor
- Siebel School of Computing and Data Science
- Views
- 52
- Originating Calendar
- Siebel School Colloquium Series
Zoom: https://illinois.zoom.us/j/87333811015?pwd=GQaYckAikXihkxhqjLHO9bfA9vlHO8.1
Refreshments Provided.
Abstract:
Multimodal alignment is often treated as a black-box training recipe. We argue that the key question is mechanistic: how does cross-modal alignment form inside the model, and when does it preserve geometry versus collapse into language-driven shortcuts? We study it as a representation learning problem: understand how alignment emerges in internal states, shape it with geometry structured priors, and control it to remain reliable. First, we open up multimodal models to probe the internal mechanisms of alignment and to intervene when geometry-relevant structure is preserved or lost. Second, we introduce VAGEN, framing multimodal representation learning as multi-stage prior injection: injecting vision and map/world-model priors so that latent states become structured abstractions that support “think as a map”, not just token-level matching. Finally, we present ODE-Steer, which controls alignment by steering internal activation into a safe subregion defined by control barrier functions, where representations and reasoning are controllable. True multimodal intelligence requires more than aligning tokens; it requires aligning the internal physics of the model with the geometry of the world.Bio:
Multimodal alignment is often treated as a black-box training recipe. We argue that the key question is mechanistic: how does cross-modal alignment form inside the model, and when does it preserve geometry versus collapse into language-driven shortcuts? We study it as a representation learning problem: understand how alignment emerges in internal states, shape it with geometry structured priors, and control it to remain reliable. First, we open up multimodal models to probe the internal mechanisms of alignment and to intervene when geometry-relevant structure is preserved or lost. Second, we introduce VAGEN, framing multimodal representation learning as multi-stage prior injection: injecting vision and map/world-model priors so that latent states become structured abstractions that support “think as a map”, not just token-level matching. Finally, we present ODE-Steer, which controls alignment by steering internal activation into a safe subregion defined by control barrier functions, where representations and reasoning are controllable. True multimodal intelligence requires more than aligning tokens; it requires aligning the internal physics of the model with the geometry of the world.
Part of the Siebel School Speakers Series. Faculty Host: Heng Ji
Meeting ID: 873 3381 1015
Passcode: csillinoisIf accommodation is required, please email <erink@illinois.edu> or <communications@cs.illinois.edu>. Someone from our staff will contact you to discuss your specific needs
