Baharan Mirzasoleiman Electrical and Computer Engineering Seminar
- Event Type
- Seminar/Symposium
- Sponsor
- Electrical and Computer Engineering
- Location
- B02 CSL Auditorum & Zoom
- Date
- Mar 24, 2025 10:00 - 11:00 am
- Speaker
- Dr. Baharan Mirzasoleiman, University of California, Los Angeles
- Contact
- Angie Ellis
- amellis@illinois.edu
- Phone
- 217-300-1910
- Views
- 151
- Originating Calendar
- Illinois ECE Calendar
Electrical and Computer Engineering Seminar
Baharan Mirzasoleiman
Assistant Professor, University of California, Los Angeles
Monday, March 24, 2025, 10:00-11:00 am
B02 CSL Auditorium or Online via Zoom
Title: Data-efficient Training of Foundation Machine Learning Models
Abstract: Large datasets have been crucial to the success of foundation machine learning models. However, training on massive data has two major limitations. First, it is contingent on exceptionally large and expensive computational resources, and incurs a substantial cost due to the significant energy consumption. Second, due to the highly imbalanced and noisy nature of real-world datasets, training on the entire data does not result in optimal performance.
In this talk, I will argue that we can address the above limitations by developing techniques that can identify and extract the representative subsets from massive datasets. Training on representative subsets not only reduces the substantial costs of learning from big data, but also improves their accuracy and robustness. I will present two theoretically-rigorous approaches to find smaller subsets of examples that improve the performance and efficiency of training foundation models, such as Vision-Language Models (VLMs) and Large Language Models (LLMs). First, I will discuss how we can formulate an optimization problem to find smaller subsets of large image-text data to efficiently pretrain VLMs such as CLIP. Then, I'll discuss how we can formulate and extract smaller subsets of language data that considerably improve the performance and efficiency of fine-tuning and pretraining LLMs. I'll conclude each part by showing empirical results confirming the effectiveness of the above data selection strategies.
Baharan Mirzasoleiman is an Assistant Professor in the Computer Science Department at UCLA, where she leads the BigML research group. Her research aims to address sustainability, reliability, and efficiency of machine learning. Before joining UCLA, Baharan was a postdoctoral research fellow in Computer Science at Stanford University. She received her Ph.D. in Computer Science from ETH Zurich, where she received an ETH medal for Outstanding Doctoral Thesis. She has received an NSF Career Award, an Okawa Research Award, a UCLA Hellman Fellows Award, and multiple Faculty Awards from Amazon, Optum AI, and Cisco. She was also named a Rising Star in EECS by MIT.