Abstract: This talk will introduce works on multimodal post‑training that propel large models from single‑turn, text‑only reasoning toward agentic systems capable of solving long‑horizon agentic tasks with tool use. We highlight three mutually reinforcing pillars: (1) Curating the right data—leveraging compute-intensive, Monte-Carlo-guided sampling to surface the problems that actually teach the model; (2) Learning perceptual skills with right tasks—using verifiable reinforcement-learning proxy tasks such as ViCrit to instill visual perception strategies that transfer beyond the training domain; and (3) Building the right infrastructure—asynchronous trainers, service-wrapped tools, and explicit visual-state modeling that let models reason over multi-turn traces. Together, these ingredients transform vision‑language models into tool‑augmented agents that solve complex multimodal tasks across extended time horizons.
Bio: Dr. Zhengyuan Yang is a Principal Researcher at Microsoft. He received his PhD degree from University of Rochester. His research interests involve multimodal foundation models, and post-training for long-horizon reasoning and agentic tasks. He has received awards including ACM SIGMM Award for Outstanding Ph.D. Thesis and ICPR Best Industry Related Paper Award. He has served as Organizing Committee for ICME, Area Chair for ICLR, EMNLP, NAACL, ACMMM, AAAI, and Associated/Guest Editor for IEEE TCSVT, TMM. His homepage: https://zyang-ur.github.io