Wahid Bhimji
Biographical Sketch
Wahid Bhimji leads NERSC’s Data and AI Services Group. His interests include machine learning and data management. Recently he has led several projects applying AI for science, including deep learning at scale, generative models, and probabilistic programming. He has coordinated aspects of machine learning deployment for the Lab’s CS-Area and NERSC: including the Perlmutter HPC system and plans for future NERSC machines. Previously he was user lead for the commissioning of Cori Phase 1, particularly data services, and the Burst Buffer. Wahid has worked for many years in Scientific Computing and Data Analysis in Academia and the UK Government and holds a Ph.D. in high-energy particle physics.
Recent relevant presentations
“Deep learning for fundamental sciences using high-performance computing” O’Reilly AI conference, September 7, 2018 (website and talk)
Deep learning for HEP/NP at NERSC”, Machine Learning Seminar, Jefferson Lab, November 6, 2018 (talk)
Enabling production HEP workflows on Supercomputers at NERSC” - Computing in High-Energy Physics (CHEP 2018), Sofia, July 2018 (talk, conference website)
“Interactive Distributed Deep Learning with Jupyter Notebooks” - Workshop on Interactive High-Performance Computing ISC 2018, Frankfurt, June 2018 (talk, workshop website)
“Adversarial Neural Networks for Science” - HPC User Forum, Tuscon Arizona -Apr 2018 (talk, workshop agenda)
“Using MongoDB with supercomputers at NERSC,” Federal MongoDB Briefing Palo Alto, Mar 2018 (talk)
“Deep learning for HEP and Cosmology” - Kavli IPMU-Berkeley Symposium: Tokyo, Japan - Jan 2018 (talk, workshop website)
Recent relevant publications
“Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale” Baydin, Shao, Bhimji et. al Accepted to SC19 - Best Paper award finalist. arXiv:1907.03382
“Efficient Probabilistic Inference in the Quest for Physics Beyond the Standard Model” Baydin, G., et al. Accepted at NeurIPS 2019 arXiv:1807.07706
“Machine Learning Templates for QCD Factorization in the Search for Physics Beyond the Standard Mode” Lin J., Bhimji, Nachman.Published in JHEP 1905 (2019) 181 DOI:10.1007/JHEP05(2019)181
“Graph Neural Networks for IceCube Signal Classification” Choma et al. Presented at ICMLA - Best Paper award DOI: 10.1109/ICMLA.2018.00064
“CosmoGAN: creating high-fidelity weak lensing convergence maps using Generative Adversarial Networks” Mustafa et al. Published in Computational Astrophysics and Cosmology 2019 6:1 DOI: 10.1186/s40668-019-0029-9 <a " href="https://arxiv.org/abs/1706.02390">arXiv:1706.02390
“Next Generation Generative Neural Networks for HEP” - Farrell and Bhimji Accepted to EPJ Web of Conferences 214, 09005 (2019)
https://doi.org/10.1051/epjconf/201921409005
“Deep Neural Networks for Physics Analysis on low-level whole-detector data at the LHC” Bhimji et al. Presented at ACAT arXiv:1711.03573
Journal Articles
Mustafa Mustafa, Deborah Bard, Wahid Bhimji, Rami Al-Rfou, Zarija Lukić, "Creating Virtual Universes Using Generative Adversarial Networks", Submitted To Sci. Rep., June 1, 2017,
Debbie Bard, Wahid Bhimji, David Paul, Glenn K Lockwood, Nicholas J Wright, Katie Antypas, Prabhat Prabhat, Steve Farrell, Andrey Ovsyannikov, Melissa Romanus, others, "Experiences with the Burst Buffer at NERSC", Supercomputing Conference, November 16, 2016, LBNL LBNL-1007120,
- Download File: works-1.bib (bib: 9 KB)
Wahid Bhimji, Debbie Bard, Melissa Romanus, David Paul, Andrey Ovsyannikov, Brian Friesen, Matt Bryson, Joaquin Correa, Glenn K Lockwood, Vakho Tsulaia, others, "Accelerating science with the NERSC burst buffer early user program", Cray User Group, May 11, 2016, LBNL LBNL-1005736,
- Download File: works-1.bib (bib: 9 KB)
NVRAM-based Burst Buffers are an important part of the emerging HPC storage landscape. The National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory recently installed one of the first Burst Buffer systems as part of its new Cori supercomputer, collaborating with Cray on the development of the DataWarp software. NERSC has a diverse user base comprised of over 6500 users in 700 different projects spanning a wide variety of scientific computing applications. The use-cases of the Burst Buffer at NERSC are therefore also considerable and diverse. We describe here performance measurements and lessons learned from the Burst Buffer Early User Program at NERSC, which selected a number of research projects to gain early access to the Burst Buffer and exercise its capability to enable new scientific advancements. To the best of our knowledge this is the first time a Burst Buffer has been stressed at scale by diverse, real user workloads and therefore these lessons will be of considerable benefit to shaping the developing use of Burst Buffers at HPC centers.
Georges Aad, others (ATLAS Collaboration), "Identification of Boosted, Hadronically Decaying W and Comparisons with ATLAS Data Taken at sqrt(s) = 8 TeV", Submitted to Eur. Phys. J. C, 2015,
Georges Aad, others (ATLAS and CMS Collaborations), "Combined Measurement of the Higgs Boson Mass in pp collisions at sqrt{s}=7 and 8 TeV with the ATLAS and CMS Experiments", Phys. Rev. Lett., 2015, 114:191803, doi: 10.1103/PhysRevLett.114.191803
Michela Massimi, Wahid Bhimji, "Computer simulations and experiments: The case of the Higgs boson", Stud. Hist. Philos. Mod. Phys., 2015, 51:71-81, doi: 10.1016/j.shpsb.2015.06.003
T. Maier, D. Benjamin, W. Bhimji, Elmsheuser, P. van Gemmeren, D. Malon, N. Krumnack, "ATLAS I/O performance optimization in as-deployed environments", J. Phys. Conf. Ser., 2015, 664:042033, doi: 10.1088/1742-6596/664/4/042033 ,
Conference Papers
Lisa Gerhardt, Stephen Simms, David Fox, Kirill Lozinskiy, Wahid Bhimji, Ershaad Basheer, Michael Moore, "Nine Months in the Life of an All-flash File System", Proceedings of the 2024 Cray User Group, May 8, 2024,
NERSC’s Perlmutter scratch file system, an all-flash Lustre storage system running on HPE (Cray) ClusterStor E1000 Storage Systems, has a capacity of 36 PetaBytes and a theoretical peak performance exceeding 7 TeraBytes per second across HPE’s Slingshot network fabric. Deploying an all-flash Lustre file system was a leap forward in an attempt to meet the diverse I/O needs of NERSC. With over 10,000 users representing over 1,000 different projects that span multiple disciplines, a file system that could overcome the performance limitations of spinning disk and reduce performance variation was very desirable. While solid state provided excellent performance gains, there were still challenges that required observation and tuning. Working with HPE’s storage team, NERSC staff engaged in an iterative process that increased performance and provided more predictable outcomes. Through the use of IOR and OBDfilter tests, NERSC staff were able to closely monitor the performance of the file system at regular intervals to inform the process and chart progress. This paper will document the results of and report insights derived from over 9 months of NERSC’s continuous performance testing, and provide a comprehensive discussion of the tuning and adjustments that were made to improve performance.
K. Z. Ibrahim, T. Nguyen, H. Nam, W. Bhimji, S. Farrell, L. Oliker, M. Rowan, N. J. Wright, Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), IEEE, November 2021, doi: 10.1109/PMBS54543.2021.00007
Wahid Bhimji, Debbie Bard, Kaylan Burleigh, Chris Daley, Steve Farrell, Markus Fasel, Brian Friesen, Lisa Gerhardt, Jialin Liu, Peter Nugent, Dave Paul, Jeff Porter, Vakho Tsulaia, "Extreme I/O on HPC for HEP using the Burst Buffer at NERSC", Journal of Physics: Conference Series, December 1, 2017, 898:082015,
Jialin Liu, Quincey Koziol, Houjun Tang, François Tessier, Wahid Bhimji, Brandon Cook, Brian Austin, Suren Byna, Bhupender Thakur, Glenn K. Lockwood, Jack Deslippe, Prabhat, "Understanding the IO Performance Gap Between Cori KNL and Haswell", Proceedings of the 2017 Cray User Group, Redmond, WA, May 10, 2017,
The Cori system at NERSC has two compute partitions with different CPU architectures: a 2,004 node Haswell partition and a 9,688 node KNL partition, which ranked as the 5th most powerful and fastest supercomputer on the November 2016 Top 500 list. The compute partitions share a common storage configuration, and understanding the IO performance gap between them is important, impacting not only to NERSC/LBNL users and other national labs, but also to the relevant hardware vendors and software developers. In this paper, we have analyzed performance of single core and single node IO comprehensively on the Haswell and KNL partitions, and have discovered the major bottlenecks, which include CPU frequencies and memory copy performance. We have also extended our performance tests to multi-node IO and revealed the IO cost difference caused by network latency, buffer size, and communication cost. Overall, we have developed a strong understanding of the IO gap between Haswell and KNL nodes and the lessons learned from this exploration will guide us in designing optimal IO solutions in many-core era.
Evan Racah, Seyoon Ko, Peter Sadowski, Wahid Bhimji, Craig Tull, Sang-Yun Oh, Pierre Baldi, Prabhat, "Revealing Fundamental Physics from the Daya Bay Neutrino Experiment using Deep Neural Networks", ICMLA, 2016,
Tina Declerck, Katie Antypas, Deborah Bard, Wahid Bhimji, Shane Canon, Shreyas Cholia, Helen (Yun) He, Douglas Jacobsen, Prabhat, Nicholas J. Wright, "Cori - A System to Support Data-Intensive Computing", Cray User Group Meeting 2016, London, England, May 2016,
- Download File: Cori-CUG2016.pdf (pdf: 4.4 MB)
Mostofa Patwary, Nadathur Satish, Narayanan Sundaram, Jialin Liu, Peter Sadowski, Evan Racah, Suren Byna, Craig Tull, Wahid Bhimji, Prabhat, Pradeep Dubey, "PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures", IPDPS 2016, April 5, 2016,
Presentation/Talks
Tina Declerck, Katie Antypas, Deborah Bard, Wahid Bhimji, Shane Canon, Shreyas Cholia, Helen (Yun) He, Douglas Jacobsen, Prabhat, Nicholas J. Wright, Cori - A System to Support Data-Intensive Computing, Cray User Group Meeting 2016, London, England, May 12, 2016,
Yun (Helen) He, Wahid Bhimji, Cori: User Update, NERSC User Group Meeting, March 24, 2016,
- Download File: Cori-User-Update-NUG2016.pdf (pdf: 847 KB)
Reports
GK Lockwood, D Hazen, Q Koziol, RS Canon, K Antypas, J Balewski, N Balthaser, W Bhimji, J Botts, J Broughton, TL Butler, GF Butler, R Cheema, C Daley, T Declerck, L Gerhardt, WE Hurlbert, KA Kallback-Rose, S Leak, J Lee, R Lee, J Liu, K Lozinskiy, D Paul, Prabhat, C Snavely, J Srinivasan, T Stone Gibbins, NJ Wright, "Storage 2020: A Vision for the Future of HPC Storage", October 20, 2017, LBNL LBNL-2001072,
Posters
Annette Greiner, Evan Racah, Shane Canon, Jialin Liu, Yunjie Liu, Debbie Bard, Lisa Gerhardt, Rollin Thomas, Shreyas Cholia, Jeff Porter, Wahid Bhimji, Quincey Koziol, Prabhat, "Data-Intensive Supercomputing for Science", Berkeley Institute for Data Science (BIDS) Data Science Faire, May 3, 2016,
Review of current DAS activities for a non-NERSC audience.