Abstract: The big data boom in recent years covers a wide spectrum of heterogeneous data types, from text to image, video, speech, and multimedia. Most of the valuable information in such "big data" is encoded in natural language, which makes it accessible to some people -- for example, those who can read that particular language -- but much less amenable to computer processing beyond a simple keyword search.
Information Extraction (IE) on a massive scale aims to extract structured facts from a wide spectrum of heterogeneous unstructured data types. Traditional IE techniques are limited to a certain source X (X = a particular language, domain, limited number of pre-defined fact types, single data modality, ...). When moving from X to a new source Y, we need to start from scratch again by annotating a substantial amount of training data and developing Y-specific extraction capabilities. In this talk, I will present a new Universal IE paradigm based on share-and-transfer, to combine the merits of traditional IE (high quality and fine granularity) and Open IE (high scalability). This framework is able to discover schemas and extract facts from any input data in any domain, without any annotated training data, by integrating distributional semantics and symbolic semantics. It can also be extended to hundreds of languages, thousands of fact types and multiple data modalities by constructing a multi-lingual multi-media multi-task common semantic space and then performing zero-shot transfer learning across sources.
Bio: Heng Ji is a professor at Computer Science Department of University of Illinois at Urbana-Champaign. She received her B.A. and M. A. in Computational Linguistics from Tsinghua University, and her M.S. and Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Information Extraction and Knowledge Base Population. She is selected as "Young Scientist" and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. The awards she received include "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013 and NSF CAREER award in 2009. She has coordinated the NIST TAC Knowledge Base Population task since 2010. She is the associate editor for IEEE/ACM Transaction on Audio, Speech, and Language Processing, and served as the Program Committee Co-Chair of many conferences including NAACL-HLT2018.