Abstract: Synthetic data produced by various (generative) machine learning algorithms is increasingly being peddled as a cutting-edge technology that enables the release of "truly anonymous" datasets (cit.), while preserving the interesting statistical properties of real data. If you're planning to attend this talk, you probably already know (or have guessed) that's not quite right. On the other hand, training generative models that satisfy Differential Privacy (DP) can provably bound the privacy leakage from the models and from the synthetic datasets they produce.
In this talk, we'll attempt to provide a realistic overview of the status quo in synthetic data (mainly, tabular data), focusing on:
- - Privacy attacks against synthetic data
- - Utility-privacy tradeoffs in DP synthetic data
- - The disproportionate impact of DP techniques on underrepresented groups
- - Why and how to audit DP synthetic data
- - How (not) to build end-to-end DP pipelines for synthetic data
- - Why you should not provide privacy guarantees using similarity-based metrics (despite lots of companies do)
We'll end with a discussion of open research problems and (hopefully) exciting items for future work. P.S. This work is almost entirely based on the work led by my students Georgi, Sundar, Bristena, and Luca.
Bio: Emiliano De Cristofaro is a Professor at University of California, Riverside. He's affiliated with the Department of Computer Science and Engineering and the RAISE Institute. From 2013 to 2023, he was at UCL, where he served as Director of the Academic Center of Excellence in Cyber Security Research (ACE-CSR) and Head of the Information Security Research Group (ISRG). Emiliano earned his PhD from UC Irvine in 2011 and worked as a Research Scientist at Xerox PARC from 2011 to 2013. His research has been published at top-tier conferences in security, machine learning, measurement, the web, and computational social science.
With his co-authors, he has received distinguished paper awards from IEEE S&P, NDSS, CCS, IMC, CSCW, ICWSM, and WebSci, the Data Protection by Design Award from the Catalan Data Protection authority, and was runner-up at the CSAW Applied Research Competitions and the INRIA-CNIL Privacy Protection Award. In 2022 and 2024, he achieved the top-4 security “grand slam” (ikr?).