Abstract: The big data boom in recent years covers a wide spectrum of heterogeneous data types, from text to image, video, speech, and multimedia. Most of the valuable information in such "big data" is encoded in natural language, which makes it accessible to some people — for example, those who can read that particular language — but much less amenable to computer processing beyond a simple keyword search. My focused research area, cross-source Information Extraction (IE) on a massive scale, aims to extract accurate, concise, and trustable information embedded in big data from heterogeneous sources, and thus create the next generation of information access in which humans can communicate with computers in any natural language.
Traditional IE approaches heavily relied on pre-defined schemas and a substantial amount of manual clean annotations for training, and thus they were limited to a certain domain, genre, language, and data modality. In this talk, I will present a new Universal IE paradigm to combine the merits of traditional IE (high quality and fine granularity) and Open IE (high scalability). This framework is able to discover schemas and extract facts from any input data in any domain, without any annotated training data or predefined schema, by combining distributional semantics and symbolic semantics. I will describe how to construct a common semantic space that is massively scalable across thousands of languages, multiple data modalities (text, images, videos), and sources. This common space is capable of representing knowledge elements at all scales from atomic entities to structured relations and events, conducting inference and transfering knowledge across sources. Because of this new paradigm, IE techniques, for the first time, can be extended from a dozen of knowledge types to thousands of types, from a few dominant languages to thousands of low-resource languages, from texts to multiple data modalities including images and videos, with higher quality and lower cost. The resulting systems have won top performance at various NIST international research evaluations for a decade and been selected for DARPA, ARL, AFRL, DTRA and FBI demos and transitions.
Bio: Heng Ji is the Edward P. Hamilton Development Chair Professor in Computer Science at Rensselaer Polytechnic Institute. She received her Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Information Extraction and Knowledge Base Population. She was selected as "Young Scientist" and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. She received "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, Google Research Awards in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014, and Bosch Research Awards in 2015, 2016 and 2017. She coordinated the NIST TAC Knowledge Base Population task since 2010, and led the DARPA DEFT TinkerBell team consisting of seven universities (Columbia, Cornell, JHU, RPI, Stanford, UIUC and UPenn), and the ARL knowledge networks construction task performed by RPI, UIUC and USC. She is now serving as the Program Committee Co-Chair of NAACL2018. Her research has been widely supported by the U.S. government (NSF, DARPA, ARL, IARPA, AFRL and DHS) and industry (Bosch, Google, IBM and Disney). http://nlp.cs.rpi.edu/hengji.html