Engineering Seminars Test Calendar 2.10

Baharan Mirzasoleiman Electrical and Computer Engineering Seminar

Event Type

Seminar/Symposium

Sponsor

Electrical and Computer Engineering

Location

B02 CSL Auditorum & Zoom

Date

Mar 24, 2025 10:00 - 11:00 am

Speaker

Dr. Baharan Mirzasoleiman, University of California, Los Angeles

Contact

Angie Ellis

E-Mail

amellis@illinois.edu

Phone

217-300-1910

Views

151

Originating Calendar

Illinois ECE Calendar

Electrical and Computer Engineering Seminar

Baharan Mirzasoleiman

Assistant Professor, University of California, Los Angeles

Monday, March 24, 2025, 10:00-11:00 am

B02 CSL Auditorium or Online via Zoom

Title: Data-efficient Training of Foundation Machine Learning Models

Abstract: Large datasets have been crucial to the success of foundation machine learning models. However, training on massive data has two major limitations. First, it is contingent on exceptionally large and expensive computational resources, and incurs a substantial cost due to the significant energy consumption. Second, due to the highly imbalanced and noisy nature of real-world datasets, training on the entire data does not result in optimal performance.

In this talk, I will argue that we can address the above limitations by developing techniques that can identify and extract the representative subsets from massive datasets. Training on representative subsets not only reduces the substantial costs of learning from big data, but also improves their accuracy and robustness. I will present two theoretically-rigorous approaches to find smaller subsets of examples that improve the performance and efficiency of training foundation models, such as Vision-Language Models (VLMs) and Large Language Models (LLMs). First, I will discuss how we can formulate an optimization problem to find smaller subsets of large image-text data to efficiently pretrain VLMs such as CLIP. Then, I'll discuss how we can formulate and extract smaller subsets of language data that considerably improve the performance and efficiency of fine-tuning and pretraining LLMs. I'll conclude each part by showing empirical results confirming the effectiveness of the above data selection strategies.

Baharan Mirzasoleiman is an Assistant Professor in the Computer Science Department at UCLA, where she leads the BigML research group. Her research aims to address sustainability, reliability, and efficiency of machine learning. Before joining UCLA, Baharan was a postdoctoral research fellow in Computer Science at Stanford University. She received her Ph.D. in Computer Science from ETH Zurich, where she received an ETH medal for Outstanding Doctoral Thesis. She has received an NSF Career Award, an Okawa Research Award, a UCLA Hellman Fellows Award, and multiple Faculty Awards from Amazon, Optum AI, and Cisco. She was also named a Rising Star in EECS by MIT.