Deep Learning at Scale Training, March 3-4, 2025
NERSC and NVIDIA are hosting a hybrid, hands-on Deep Learning at Scale training event on March 3-4 in Berkeley, CA . This training will help users explore distributed training for deep learning models on high-performance computing systems (specifically Perlmutter). The training will focus on building a large-scale deep learning model on a real scientific application (transformers for weather forecasting) and walk users through profiling tools and performance optimization on a single GPU, scaling to multiple GPUs (and nodes) through distributed training with data parallelism (along with tips and techniques to scale) as well as advanced parallelization for very large models with model parallelism.
We will provide example code and datasets to allow attendees to experiment hands-on with optimized and scalable distributed training of our scientific deep learning model on Perlmutter. Due to the hands-on experiments on Perlmutter, the event attendance will be capped. However, all training material as well as the lecture recordings will be made available after the event. OLCF and ALCF users are welcome to attend. Training accounts will be provided if needed.
Logistics
This event will be hybrid. Onsite location (in B59, see visitor information for more details) and zoom link details will be shared soon.
Agenda
All times are in Pacific time zone. Agenda below is tentative.
Day 1: March 3 | ||
Time | Topic | Presenter |
09:00 - 10:00 | Introduction + Perlmutter Setup | Shashank Subramanian (NERSC) and Steven Farrell (NERSC) |
10:00 - 10:15 | Break | |
10:15 - 11:00 | Deep Learning Performance on a GPU | Josh Romero (NVIDIA) |
11:00 - 12:00 | Hands-on: Profiling and Optimizing GPU Training | Josh Romero (NVIDIA) and NERSC |
12:00 - 13:00 | Discussions | NERSC |
Day 2: March 4 | ||
09:00 - 09:30 | Scaling with Data Parallelism | Steven Farrell (NERSC) |
09:30 - 10:30 | Hands-on: Data Parallelism | NERSC |
10:30 - 10:45 | Break | |
10:45 - 11:15 | Scaling with Model Parallelism | Shashank Subramanian (NERSC) |
11:15 - 12:15 | Hands-on: Model Parallelism | NERSC |
12:15 - 13:00 | Discussions | NERSC |
Registration
Registration will be on a first-come first-served basis and will be capped due to the hands-on nature of the event. Please fill out the registration form for attending.