Grainger College of Engineering, All Events

Computer Vision Seminar Series: Boqing Gong, "BabyVLM: Democratizing Research on the Pretraining of Vision Large Language Models."

May 1, 2026   4:00 - 5:00 pm  
0216 Siebel Center
Sponsor
Illinois Computer Vision
Speaker
Dr. Boqing Gong
Contact
Yao Xiao
E-Mail
yaox11@illinois.edu
Originating Calendar
Siebel School Speakers Calendar

Abstract: Pretraining vision foundation models (VFMs) is prohibitively expensive, making it a privilege for institutions with abundant resources and leaving independent researchers to downstream tasks, such as benchmarking, interpreting, and aligning VFMs. This situation is a crisis for computer vision research — “What I cannot create, I do not understand,” quoted Richard Feynman. Independent researchers and the public cannot gain a true understanding, trust, and safe use of VFMs passively from open weights or APIs. Meanwhile, the few privileged VFM creators could momentarily reach a plateau without the broad research community’s nurturing.  

Hence, we propose democratizing VFM pretraining by scaling it down to a developmentally plausible framework that is scientifically reasonable and computationally friendly to university budgets, aiming to promote exploration rather than exploitation of the pretraining and enable independent researchers to build general-purpose VFMs that approach “baby intelligence” to benefit efforts towards “grown-up” AI. This framework will closely mimic the minimal yet highly informative sensory experiences of human infants, encompassing 1) Pretraining data curated from longitudinal, egocentric audiovisual recordings of babies, 2) A suite of developmentally aligned evaluation benchmarks assessing VFM capabilities against cognitive milestones like object permanence, social skills, and language acquisition, and 3) A user-friendly pretraining codebase and baseline models. 

Speaker Bio.: Boqing Gong (https://boqinggong.github.io) is a computer science faculty member at Boston University and a part-time research scientist at Google DeepMind. His research on machine learning and computer vision focuses on visual recognition, video, and AI models’ generalization and efficiency.

link for robots only