Computer Science Speaker Series Master Calendar

Back to Listing

Dong Deng: "Data Curation at Scale"

Event Type
Department of Computer Science
2405 Thomas M. Siebel Center for Computer Science
Mar 8, 2018   10:00 am  
Dr. Dong Deng, Postdoctoral Fellow, MIT CSAIL
Lisa Yanello
Originating Calendar
Computer Science Speakers Calendar

Abstract: Data curation (ingest, transformation, cleaning, schema mapping, deduplication, and consolidation) of raw data sets consumes up to 80% of a data scientist’s time. Integrating silos of enterprise data is also a major challenge to business users. To address these issues, we have built an end-to- end data curation system, Data Civilizer, in cooperation with the Qatar Computing Research Institute.

In this talk, I will briefly introduce the Data Civilizer system. Then I will discuss two of the components that I have constructed. First, I will discuss entity consolidation in Data Civilizer. This module accepts a collection of clusters of records thought to represent the same entity (i.e. duplicates) and merges each cluster into a single “golden” record. Next, I will show how to address the key challenges to enable scalable entity matching in Data Civilizer. Finally, I will conclude the talk with my future vision on data curation for end-users, and massive data lake management.

Bio: Dong Deng is a postdoctoral associate in the Computer Science and Artificial Intelligence Lab (CSAIL) at MIT, where he works with Prof. Michael Stonebraker and Prof. Samuel Madden. He is interested in data management and data science, with a special focus on tackling the theoretical and system building challenges in data curation. Dong obtained his PhD degree from Tsinghua University in 2016 with the highest doctoral dissertation award. He also received scholarships from the Siebel Foundation, Google, Microsoft, Intel, and Boeing Company and has been regularly publishing in top venues including SIGMOD, PVLDB, and ICDE.

link for robots only