Center for Artificial Intelligence Innovation at NCSA is organizing online training sessions throughout the Spring 2023 semester to help users to get started with deep learning projects on HAL. These sessions are designed for novice users to learn about the system and start building deep neural network models. To sign up for training, just request a HAL account prior to the training session and mention "spring training" when describing how the system will be used in your project. Trainings will take place every Wednesday during the fall semester from 3-5pm via Zoom.
Training Link: https://go.ncsa.illinois.edu/CAIIHALTraining
February 15, 2023: Distributed Data Parallel Model Training in PyTorch - Shirui Luo
Training Overview:
This tutorial walks through distributed data parallel training in PyTorch via DDP. We will start with a simple non-distributed training job, and end with deploying a training job across several GPUs in a single HAL node. Along the way, you will learn about DDP to accelerate your model training. You will also learn how to monitor GPU status to help profile code performance to fully utilize GPU computing power.
Sessions will be recorded and available on the CAII website after the training.