Data Analytics
Overview
NERSC engages in research and development to ensure highly-scalable, productive data, AI and analytics tools are available and deployed at NERSC; as well as to promote the use of cutting-edge data and analytics approaches and technologies in science. Recent activity is outlined below.
Python and Jupyter
Python continues to be popular with users: We measured more than 2000 unique users in 2020. We have also deployed new NERSC monitoring that captures Python library imports for each user. We are working to build infrastructure to analyze this large amount of data and also gain insights about how people are using Python at NERSC. An additional R&D thrust is Python GPU preparedness. We are working to understand and test various Python GPU frameworks to help our users transition their Python code to Perlmutter. Through this work, we lead the Python HPC community - e.g., through co-chairing the HPC track at SciPy
- Rollin Thomas, Laurie Stephey, Annette Greiner, and Brandon Cook, 2021, “Monitoring Scientific Python Usage on a Supercomputer,” Proceedings of the 20th Python in Science Conference (SciPy 2021), (https://doi.org/10.25080/majora-1b6fd038-010).
- Daniel Margala, Laurie Stephey, Rollin Thomas, and Stephen Bailey, 2021, “Accelerating Spectroscopic Data Processing Using Python and GPUs on NERSC Supercomputers,” Proceedings of the 20th Python in Science Conference (SciPy 2021), (https://doi.org/10.25080/majora-1b6fd038-004).
Jupyter
The Jupyter interactive environment is enabling a new mode of computing for scientists at NERSC. Scientists love Jupyter because it combines documentation, visualization, data analytics, and code into a document they can share, modify, and even publish.
DAS and the Usable Software Systems group in CRD are partnering to enhance the Jupyter framework for high performance scientific computing environments and to develop new capabilities in Jupyter for the next generation of scientific analysis. NERSC and CRD have taken a collaborative approach to our development of Jupyter-centric tools, where we collaborate with scientific partners to develop tools that are useful to the broader community and deploy these for all users at NERSC.
Use cases from scientific collaborations such as NCEM, LCLS, and others provide us with insights into how Jupyter Notebooks can help make HPC more accessible to scientists. These collaborations also identify general patterns for enhancements and tools that the entirety of the NERSC user base can capitalize on. Other topics of the research collaboration include finding ways to leverage Jupyter notebooks, containers, and best practices for software development to address reproducibility in science and HPC. Related publications:
- Rollin Thomas and Shreyas Cholia, 2021, “Interactive Supercomputing with Jupyter,” Computing in Science and Engineering,
(https://www.authorea.com/doi/full/10.22541/au.161230518.84458221/v1). - Shreyas Cholia, Lindsey Heagy, Matthew Henderson, Drew Paine, Jonathan Hays, Ludovico Bianchi, Devarshi Ghoshal, Fernando Pérez, Lavanya Ramakrishnan, 2020, “Towards Interactive, Reproducible Analytics at Scale on HPC Systems.” IEEE/ACM HPC for Urgent Decision Making, UrgentHPC (https://doi.ieeecomputersociety.org/10.1109/UrgentHPC51945.2020.00011).
- Matthew Henderson, William Krinsman, Shreyas Cholia, Rollin Thomas, Trevor Slaton, 2020. “Accelerating Experimental Science Using Jupyter and NERSC HPC.” In Communications in Computer and Information Science. (https://dx.doi.org/10.1007/978-3-030-44728-1_9).
-
Dilworth Parkinson, Harinarayan Krishnan, Daniela Ushizima, Matthew Henderson, Shreyas Cholia, 2020. “Interactive Parallel Workflows for Synchrotron Tomography.” XLOOP 2020, (https://doi.ieeecomputersociety.org/10.1109/XLOOP51963.2020.00010)
Climate Analytics
NERSC has been involved in several Climate Analytics applications, including “Exascale Deep Learning for Climate Analytics,” which was the first deep learning application to achieve 1 Exaflop (FP16) with Tensorflow and Horovod and won the Gordon Bell Prize (Kurth et al. SC18)
ClimateNet
Recently, ClimateNet was launched to bring the power of Deep Learning to the climate community by creating community-sourced open-access expert-labeled datasets and architectures for improved accuracy and performance, details of which were presented at ILCR and other venues.
I/O and Data Management
ExaHDF5
The ExaHDF5 project will productize features and techniques prototyped in earlier projects, explore optimization strategies on upcoming architectures, maintain and optimize existing HDF5 features for ECP applications, and release these new features in HDF5 for broad deployment on HPC systems. Focusing on the challenges of exascale I/O, we will develop technologies based on the massively parallel storage hierarchies that are being built into pre-exascale systems. We will enhance HDF5 software to achieve efficient parallel I/O on exascale systems in ways that will impact a large number of DOE science applications.
Proactive Data Containers (PDC)
Moving toward new paradigms for SSIO in the extreme-scale era, the Proactive Data Container (PDC) project proposes to investigate novel object-based data abstractions and storage mechanisms that take advantage of the deep storage hierarchy and enable proactive, automated performance tuning. In order to achieve these overarching goals, we propose a fundamental new data abstraction called Proactive Data Containers (PDC). A PDC is a container within a locus of storage (memory, NVRAM, disk, etc.) that stores science data in an object-oriented manner. Managing data as objects enables powerful optimization opportunities for data movement and transformations. In this project, we will research: 1) formulation of object-oriented PDCs and their mapping in different levels of the exascale storage hierarchy; 2) efficient strategies for moving data in deep storage hierarchies using PDCs; 3) techniques for transforming and reorganizing data based on application requirements; and 4) novel analysis paradigms for enabling data transformations and user-defined analysis on data in PDCs. The intent of our research is to move the field of HPC SSIO in a direction where it may ultimately be possible to develop scientific applications without the need to perform cumbersome and inefficient tuning to optimize data movement on every system the application runs on.