Abstract: The 2025 releases of OpenAI's GPT-4o image generation and Google's "Nano Banana" (Gemini 2.5 Flash Image) marked a clear paradigm shift. This shift moves image generation away from diffusion models (e.g., Stable Diffusion) and into "omni models"—general-purpose transformers capable of understanding and generating across multiple modalities. This "LLM-native" approach integrates generation directly into the chat context, enabling superior controllability, robust text rendering, and intuitive language-based editing.
This talk provides a technical survey of this emerging field. We will discuss how these systems are typically created by finetuning existing large language models (LLMs) or vision-language models (VLMs) to acquire generative capabilities. We will analyze the core mechanisms, where a single transformer learns to emit visual tokens, often alongside text, using next-token prediction. We will review recent works along this line and systematically compare and contrast key architectural decisions that impact performance.
Bio: Haoxiang Wang is a Research Scientist at Luma AI. Prior to joining Luma, he was a Research Scientist at NVIDIA, where he worked on world models and vision-language models. Haoxiang completed his Ph.D. in Electrical and Computer Engineering at UIUC in 2024, under the supervision of Prof. Han Zhao and Prof. Bo Li. His Ph.D. research focused on several areas of machine learning, such as RLHF, OOD generalization, and multi-task learning. During his studies, he also interned at Apple, Amazon, and Waymo.