Performance Modeling of Foundation AI Models in Science
NERSC engineers published a performance model that allows researchers to explore the complex design space of optimizing large transformers in scientific AI. The model allows for analyzing performance sensitivity to the transformer model type, the parallelization strategy employed for scaling, and the features of the underlying HPC system, including accelerator and interconnect characteristics (example: NVLINK or fast-bandwidth domain sizes).
Using the performance model, they identified typical training bottlenecks at various system (and AI model) scales for transformers and highlighted how subtle changes in parallelization configurations—such as the placement order of GPU groups for different strategies within the NVLINK domain—can impact performance. By solving a combinatorial optimization problem, the performance model also reveals the optimal parallelization strategy to minimize training time. Furthermore, the paper illustrates how transformers used in language modeling and scientific applications exhibit very different requirements, from higher dimensions of parallelization to stressing distinct aspects of the HPC system during pretraining/fine-tuning.
To learn more, you can read about it in the PMBS paper at SC 2024 or check out the open-sourced code for the mechanics of the performance model.
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, NERSC serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.