We look forward to seeing you online on Thursday, 9/14.
Abstract: Given only a single picture, people are capable of inferring a mental representation that encodes rich information about the underlying 3D scene. We acquire this skill not through massive labeled datasets of 3D scenes, but through self-supervised observation and interaction. Building machines that can infer similarly rich neural scene representations is critical if they are to one day parallel people’s ability to understand, navigate, and interact with their surroundings. In my talk, I will discuss how this motivates a 3D approach to self-supervised learning for vision. I will then present recent advances of my research group towards enabling us to train self-supervised scene representation learning methods at scale, on uncurated video without pre-computed camera poses. I will further present recent advances towards modeling of uncertainty in 3D scenes, as well as progress on endowing neural scene representations with more semantic, high-level information.
Bio: Vincent Sitzmann is an Assistant Professor at MIT EECS, where he is leading the Scene Representation Group. Previously, he did his Ph.D. at Stanford University as well as a Postdoc at MIT CSAIL. His research interest lies in neural scene representations - the way we can let neural networks learn to reconstruct the state of 3D scenes from vision. His goal is to allow independent agents to reason about our world given visual observations, such as inferring a complete model of a scene with information on geometry, material, lighting etc. from only few observations, a task that is simple for humans, but currently impossible for AI.