Meeting ID: 862 3823 3298
Title: Adapting CLIP For Phrase Localization Without Further Training
Abstract: Supervised or weakly supervised methods for phrase localization (textual grounding) either rely on human annotations or some other supervised models, e.g., object detectors. Obtaining these annotations is labor-intensive and may be difficult to scale in practice. In this talk, we will discuss my recent work on adapting CLIP for phrase localization. In its original form, CLIP only outputs an image-level embedding without any spatial resolution. To address this, we developed a method to extract spatial features from CLIP and utialize these features for zero-shot phrase localization.
Bio: Raymond Yeh is a Research Assistant Professor at Toyota Technological Institute at Chicago (TTIC). He received his PhD in 2021 from the University of Illinois at Urbana-Champaign (UIUC) advised by Prof. Alexander Schwing, Prof. Minh Do, and Prof. Mark Hasegawa-Johnson. Previously, he completed his B.S. and M.S. degree in Electrical Engineering from UIUC as well. He has also interned at Google AI and Johns Hopkins University. He is a recipient of the Google PhD Fellowship, the Mavis Future Faculty Fellowship, and the Henry Ford II Scholarship.
Raymond’s research is at the intersection of machine learning and computer vision. Specifically, his research focuses on developing algorithms to learn effective and explainable models ranging across several domains including audio, vision, language, and multi-agent systems. In Fall 2022, he will be joining Purdue University as an Assistant Professor in Computer Science.