Link to Talk Video: https://mediaspace.illinois.edu/media/t/1_fpdaxvsj
Abstract: Recently, vision Transformer (ViT) delivered a compelling message that "large scale (supervised) training trumps inductive bias." However, the supervised training of ViT is essentially troubling for at least two reasons. First, it rules out the unlabeled part of visual data, which is more significant than the labeled part in almost any sense. As a result, the supervised training strategy could produce biased systems requiring even more labeled data to correct their biases to observe the underlying distribution of the union of labeled and unlabeled visual data. Second, this strategy fundamentally limits the application scope of ViT because it is costly and time-consuming to collect and label massive data in many domains.
This talk will present our recent works that reduce ViT's dependency on massive labeled data. We first study self-supervised, multimodal pre-training of Transformers that take as input raw RGB frames of internet videos, audio waveforms, and text transcripts of the speech audio. After a Transformer watches 15 years of unlabeled video, they achieve new records on human activity recognition benchmarks and competitive results on Imagenet. Moreover, we investigate ViT from the lens of loss geometry and find extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViT. The resultant ViT outperforms ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
Bio: Boqing Gong is a staff research scientist at Google, Seattle. His research in machine learning and computer vision focuses on label-efficient learning and the visual analytics of objects, scenes, human activities, and their attributes. Before joining Google in 2019, he worked in Tencent and was a tenure-track Assistant Professor at the University of Central Florida (UCF). He received an NSF CRII award in 2016 and an NSF BIGDATA award in 2017, both of which were the first of their kind ever granted to UCF. He has served as a (senior) area chair for CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, AISTATS, and AAAI. He is also a tutorial co-chair for CVPR 2022 and a program co-chair for WACV 2023. Boqing earned a Ph.D. degree in 2015 at the University of Southern California, where the Viterbi Fellowship partially supported his work.
Part of the Illinois Computer Science Speakers Series. Faculty Host: Bo Li