With enormous amount of data on the Web and other information repositories, it is not unusual that multiple data sources may provide conflicting information about the same entity. Consequently, a major challenge for data quality control is to derive the most complete and accurate integrated information from diverse and sometimes conflicting sources. We call this challenge the truth finding problem. We observe that some sources are generally more reliable than others, and therefore a good model of source quality is the key to solving the truth finding problem.
In this talk, we overview a set of recent work on truth finding, with an emphasis on our recently proposed two probability-based models: LTM (Latent Truth Model) and GTM (Gaussian Truth Model). LTM is a probabilistic graphical model that can automatically infer true records and source quality without any supervision. In contrast to previous methods, LTM models multi-valued attribute types and leverages a generative process of two types of errors (false positive and false negative) by modeling two different aspects of source quality. The method is also scalable, owing to an efficient sampling-based inference algorithm that needs very few iterations in practice and enjoys linear time complexity, with an even faster incremental variant. Experiments on two real world datasets show that LTM outperforms existing state-of-the-art approaches to the truth finding problem. To handle truth-reasoning with numerical data such as price, weather, census, polls, and economic statistics, GTM is built upon a Bayesian probabilistic model and leverages the characteristics of numerical data in a principled way, when modeling the dependencies among source quality, truth, and claimed values. Experiments on two real world datasets show that GTM outperforms the existing state-of-the-art approaches.
Jiawei Han, Abel Bliss Professor of Computer Science, University of Illinois at Urbana-Champaign. He has been researching into data mining, information network analysis, database systems, and data warehousing, with over 600 journal and conference publications. He has chaired or served on many program committees of international conferences, including PC co-chair for KDD, SDM, and ICDM conferences, and Americas Coordinator for VLDB conferences. He also served as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data and is serving as the Director of Information Network Academic Research Center supported by U.S. Army Research Lab. He is a Fellow of ACM and IEEE, and received 2004 ACM SIGKDD Innovations Award, 2005 IEEE Computer Society Technical Achievement Award, 2009 IEEE Computer Society Wallace McDowell Award, and 2011 Daniel C. Drucker Eminent Faculty Award at UIUC. His book "Data Mining: Concepts and Techniques" has been used popularly as a textbook worldwide.
ITI is a campus-wide interdisciplinary unit of the University of Illinois at Urbana-Champaign, led by the College of Engineering, that is fostering excellence in information trust and security. Participating units include, among others, the College of Applied Health Sciences; the College of Business; the College of Engineering; the College of Law; the College of Liberal Arts and Sciences; the Department of Aerospace Engineering; the Department of Agricultural and Biological Engineering; the Department of Computer Science; the Coordinated Science Laboratory; the Department of Electrical and Computer Engineering; the Department of Industrial & Enterprise Systems Engineering; and the National Center for Supercomputing Applications.
FOR MORE INFORMATION: http://www.iti.illinois.edu/