
- Sponsor
- NCSA - Center for Artificial Intelligence Innovation
- Speaker
- Priyam Mazumdar
- Contact
- Shannon Bradley
- sbrad77@illinois.edu
- Phone
- 217-778-0887
- Originating Calendar
- Center for Artificial Intelligence Innovation
The Center for Artificial Intelligence Innovation is hosting a new hands‑on training series this Spring, “GPU Programming with Triton: From NumPy to Flash Attention.” This multi‑week workshop introduces participants to Triton, an open‑source language and compiler that makes it possible to write custom GPU kernels with a Pythonic feel. Triton bridges the gap between familiar NumPy‑style operations and the high‑performance kernels used in modern deep learning systems, including techniques like Flash Attention.
Across the series, attendees will learn how Triton enables fine‑grained control over GPU execution while remaining far more approachable than CUDA. By the end, participants will understand how to move from simple array operations to optimized kernels that power state‑of‑the‑art transformer models.
Don’t worry about needing lots of prerequisites – each kernel we learn to write in Triton will be in two steps. First we will write the pseudocode in Numpy and then follow it up by translating to the Triton kernel. This way the only prereq for everyone interested is just Numpy!
The seminar will be taught by Priyam Mazumdar, a PhD student in Electrical and Computer Engineering and a researcher at the National Center for Supercomputing Applications (NCSA) at the University of Illinois.
Schedule & Format
Tuesdays, 3–5 PM, beginning February 17, 2026 and running thru April 7, 2026
Location: University of Illinois Urbana‑Champaign, ECE Building (room TBA)
Format: Hybrid — in‑person and via Zoom (link forthcoming)Lesson Outline for 8 Sessions
- GPU Programming Fundamentals
- An overview of GPU vs. CPU architectures, why Triton exists, and how it compares to CUDA, NumPy, and Numba.
- Writing our First Kernel: Vector Sum
- Introduction to the CUDA execution model, including grid and block structure, pointer-based memory access, and a simple vector summation kernel.
- Transitioning from elementwise kernels to blockwise computation, with a focus on performance implications and scheduling costs.
- Matrix Multiplication
- Matrix multiplication as the core operation underlying most deep learning workloads, implemented step by step in Triton.
- Matrix Multiplication with Cache Optimizations
- Techniques for improving performance on large matrix multiplications, including cache-aware strategies to approach cuDNN-level efficiency.
- Fused Softmax
- Understanding GPU memory overhead and how kernel fusion can significantly improve performance.
- Fused Online Softmax
- Iterative softmax computation techniques commonly used for large-scale data processing.
- Flash Attention –
- A deep dive into FlashAttention, which underpins most modern LLMs. We will reinterpret the attention mechanism through the lens of online softmax and complete a full implementation while covering all critical details.
- Flash Attention – (continued)
- End goal is to match the performance of Torch SDPA using our own custom kernel!
- GPU Programming Fundamentals
