Today’s perception systems excel at naming things in third-person Internet photos or videos, which purposefully convey a visual scene or moment. In contrast, first-person or “egocentric” perception requires understanding the multi-modal video that streams to a person’s (or robot’s) wearable camera. While video from an always-on wearable camera lacks the curation of an intentional photographer, it does provide a special window into the camera wearer’s attention, goals, and interactions with people and objects in her environment. These factors make first-person video an exciting avenue for the future of perception in augmented reality and robot learning. Motivated by this setting, I will present our recent work on first-person video. First, we explore learning visual affordances to anticipate how objects and spaces can be used. We show how to transform egocentric video into a human-centric topological map of a physical space (such as a kitchen) that captures its primary zones of interaction and the activities they support. Moving down to the object level, we develop video anticipation models that localize interaction “hotspots” indicating how/where an object can be manipulated (e.g., pressable, toggleable, etc.). Towards translating these affordances into robot action, we prime reinforcement learning agents to prefer human-like interactions, thereby accelerating their task learning. Turning to audio-visual sensing, we attempt to extract a conversation partner’s speech from competing background sounds or other human speakers. Finally, I will briefly preview the multi-institution Ego4D project, which later this year will release a massive egocentric video dataset with more than 1,000 hours of unscripted daily-life video captured in eight countries around the world.
Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin and a Research Scientist in Facebook AI Research (FAIR). Her research in computer vision and machine learning focuses on visual recognition, video, and embodied perception. Before joining UT-Austin in 2007, she received her Ph.D. at MIT. She is an IEEE Fellow, AAAI Fellow, Sloan Fellow, and recipient of the 2013 Computers and Thought Award. She and her collaborators have been recognized with several Best Paper awards in computer vision, including a 2011 Marr Prize and a 2017 Helmholtz Prize (test of time award). She currently serves as an Associate Editor-in-Chief for PAMI and previously served as a Program Chair of CVPR 2015 and NeurIPS 2018.