NERSCPowering Scientific Discovery for 50 Years

OLCF AI Training Series: AI for Science at Scale - Part 3, Jul 11, 2024

July 11, 2024

Introduction

Held on July 11, 2024, this session is the third part of the OLCF’s AI for Science at Scale training series and is open to NERSC users.

Training large deep learning models, including large language models, is resource-intensive and requires innovative parallelization and distribution strategies. In earlier workshops, we demonstrated on Frontier how to train a deep learning model in a distributed fashion across multiple GPUs at a “small” and “intermediate” scale. For the final part of this training series, we scale up further and demonstrate how to fine-tune pre-trained networks at a larger scale on Frontier. Registered Frontier users will be able to utilize a system reservation to participate in the hands-on portion of the event.

Although this training is intended for current Frontier users, all are welcome to register and view the presentation. Additionally, no prior knowledge of Part 1 or 2 is necessary — you are encouraged to register even if you did not attend previous iterations of this series.

How to Apply

Please visit the training event page for registration information.

Agenda

Time (PDT) Topic Speaker
10:00 am  - 10:20 am  Introduction to distributed training of LLMs Sajal Dash (OLCF, Analytics & AI Methods at Scale)
10:20 am - 10:50 am Finding the best training strategies for large models Sajal Dash
10:50 am - 11:20 am Fine-tuning a pre-trained model Sajal Dash
11:30 am - 12:00 pm Hands-on demo using Frontier Sajal Dash

Training Materials

The training series GitHub page: https://github.com/olcf/ai-training-series.