We look forward to seeing you online on Thursday, April 20 for the Vision Speaker Series.
Abstract: Highly structured models of the world are a key characteristic of human perception and reasoning. The prevailing approach for training vision systems has been to distill our visual understanding into models via supervision (i.e. class labels, bounding boxes, segmentation masks, etc). Though this paradigm has been effective, it has limitations. Collecting annotations at scale for visual tasks such as inferring the geometry of objects, predicting object affordances, or estimating depth requires trained annotators and specialized equipment. Ideally, models could learn primarily through observation and interaction like humans do. In this talk I will discuss research on world models and my recent ICLR work on learning compositional representations from unsupervised video.
Bio: Matt is a fourth year PhD at the University of Washington advised by Ali Farhadi. He is interested in learning with limited supervision and understanding human visual representations of the world. Prior to his PhD he studied math and physics at Cornell where he worked with Bharath Hariharan.