Abstract: Joins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many tasks. Indeed, in as early as 1999, Chaudhuri et al. posed the problem of sampling over joins as a fundamental challenge in large database systems. They also pointed out a fundamental barrier for this problem, that the sampling operator cannot be pushed through a join, i.e., sample(R) ▷◁sample(S) does not yield sample(R ▷◁ S). To overcome this barrier, they used precomputed statistics to guide the sampling process, but only showed how this works for two-relation joins.
This talk revisits this classic problem for both acyclic and cyclic multi-way joins. We demonstrate how to perform sampling over joins effectively and efficiently for different use-case scenarios. We start with a novel technique that is able to produce independent but non-uniform samples, and show its usefulness for the classic problem of online aggregation. We then extend this technique and we propose a general framework for random sampling over multi-way joins (a random sample consists of samples that are both independent and uniform). We demonstrate the usefulness of the proposed technique for different applications, including interactive analytics and learning over joins. We analyze the properties of different instantiations of our framework and evaluate them against the baseline methods.
Bio: Feifei Li is currently an associate professor at the School of Computing, University of Utah. He obtained his Bachelor's degree from Nanyang Technological University transferred from Tsinghua University) in 2001 and PhD from Boston University in 2007. His research focuses on improving the scalability, efficiency, and effectiveness of data analytics and large-scale data management systems. He also works on data security problems in these systems. He was a recipient for a NSF career award in 2011, two HP IRP awards in 2011 and 2012 respectively, a Google App Engine award in 2013, IEEE ICDE best paper award in 2004, IEEE ICDE 10+ Years Most Influential Paper Award in 2014, a Google Faculty award in 2015, SIGMOD Best Demonstration Award in SIGMOD 2015, SIGMOD 2016 Best Paper Award, SIGMOD Research Highlight Award in 2017, and a VISA research faculty award in 2017. He is a member of the SIGMOD Jim Gray Dissertation Award selection committee in 2017, a member of the CIKM 2017 best paper award selection committee, a PC area chair for SIGMOD 2018, SIGMOD 2015, and ICDE 2014, the demo PC co-chair for SIGMOD 2018, VLDB 2014, and the general co-chair for SIGMOD 2014. He currently serves as an associate editor for both ACM TODS and IEEE TKDE.