Stephen Simms
Biographical Sketch
Stephen Simms is a member of the Advanced Technologies Group at NERSC. His work is focused on the exploration and evaluation of computational storage solutions. Before joining NERSC, Simms spent over 20 years working in High Performance Computing at Indiana University. He founded and managed Indiana University’s High Performance File Systems group. He led two successful bandwidth challenge teams and pioneered the use of the Lustre file system across wide area networks. During that time, he served as project manager for and co-investigator on the NSF funded Data Capacitor project. Later, he served as TeraGrid site-lead for Indiana University. Simms has been an active member of the Lustre community, serving as the OpenSFS community board member for several years, eventually helping rewrite the organization’s bylaws to better reflect the needs of the broader user community.
Journal Articles
Michael Kluge, Stephen Simms, Thomas William, Robert Henschel, Andy Georgi, Christian Meyer, Matthias S. Mueller, Craig A. Stewart, Wolfgang Wünsch, Wolfgang E. Nagel, "Performance and quality of service of data and video movement over a 100 Gbps testbed", Future Generation Computer Systems, 2013, 29:230--240, doi: 10.1016/j.future.2012.05.028
Conference Papers
Lisa Gerhardt, Stephen Simms, David Fox, Kirill Lozinskiy, Wahid Bhimji, Ershaad Basheer, Michael Moore, "Nine Months in the Life of an All-flash File System", Proceedings of the 2024 Cray User Group, May 8, 2024,
NERSC’s Perlmutter scratch file system, an all-flash Lustre storage system running on HPE (Cray) ClusterStor E1000 Storage Systems, has a capacity of 36 PetaBytes and a theoretical peak performance exceeding 7 TeraBytes per second across HPE’s Slingshot network fabric. Deploying an all-flash Lustre file system was a leap forward in an attempt to meet the diverse I/O needs of NERSC. With over 10,000 users representing over 1,000 different projects that span multiple disciplines, a file system that could overcome the performance limitations of spinning disk and reduce performance variation was very desirable. While solid state provided excellent performance gains, there were still challenges that required observation and tuning. Working with HPE’s storage team, NERSC staff engaged in an iterative process that increased performance and provided more predictable outcomes. Through the use of IOR and OBDfilter tests, NERSC staff were able to closely monitor the performance of the file system at regular intervals to inform the process and chart progress. This paper will document the results of and report insights derived from over 9 months of NERSC’s continuous performance testing, and provide a comprehensive discussion of the tuning and adjustments that were made to improve performance.
Francieli Boito, Jim Brandt, Valeria Cardellini, Philip Carns, Florina M. Ciorba, Hilary Egan, Ahmed Eleliemy, Ann Gentile, Thomas Gruber, Jeff Hanson, Utz-Uwe Haus, Kevin Huck, Thomas Ilsche, Thomas Jakobsche, and Terry Jones, Sven Karlsson, Abdullah Mueen, Michael Ott, Tapasya Patki, Ivy Peng, Krishnan Raghavan, Stephen Simms, Kathleen Shoga, Michael Showerman, Devesh Tiwari, Torsten Wilde, Keiji Yamamoto, "Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations", Proceedings of 2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops), October 31, 2023, 37-43, doi: 10.1109/CLUSTERWorkshops61457.2023.00016
Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in the development of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops in autonomous computing and self-adaptive systems. By defining and developing such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across different sites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.