NERSCPowering Scientific Discovery for 50 Years

Stephen Simms

IMG 0540
Stephen Simms

Biographical Sketch

Stephen Simms is a member of the Advanced Technologies Group at NERSC.  His work is focused on the exploration and evaluation of computational storage solutions. Before joining NERSC, Simms spent over 20 years working in High Performance Computing at Indiana University.  He founded and managed Indiana University’s High Performance File Systems group. He led two successful bandwidth challenge teams and pioneered the use of the Lustre file system across wide area networks.  During that time, he served as project manager for and co-investigator on the NSF funded Data Capacitor project.  Later, he served as TeraGrid site-lead for Indiana University.  Simms has been an active member of the Lustre community, serving as the OpenSFS community board member for several years, eventually helping rewrite the organization’s bylaws to better reflect the needs of the broader user community.

Journal Articles

Michael Kluge, Stephen Simms, Thomas William, Robert Henschel, Andy Georgi, Christian Meyer, Matthias S. Mueller, Craig A. Stewart, Wolfgang Wünsch, Wolfgang E. Nagel, "Performance and quality of service of data and video movement over a 100 Gbps testbed", Future Generation Computer Systems, 2013, 29:230--240, doi: 10.1016/j.future.2012.05.028

Conference Papers

Lisa Gerhardt, Stephen Simms, David Fox, Kirill Lozinskiy, Wahid Bhimji, Ershaad Basheer, Michael Moore, "Nine Months in the Life of an All-flash File System", Proceedings of the 2024 Cray User Group, May 8, 2024,

NERSC’s Perlmutter scratch file system, an all-flash Lustre storage system running on HPE (Cray) ClusterStor E1000 Storage Systems, has a capacity of 36 PetaBytes and a theoretical peak performance exceeding 7 TeraBytes per second across HPE’s Slingshot network fabric. Deploying an all-flash Lustre file system was a leap forward in an attempt to meet the diverse I/O needs of NERSC. With over 10,000 users representing over 1,000 different projects that span multiple disciplines, a file system that could overcome the performance limitations of spinning disk and reduce performance variation was very desirable. While solid state provided excellent performance gains, there were still challenges that required observation and tuning. Working with HPE’s storage team, NERSC staff engaged in an iterative process that increased performance and provided more predictable outcomes. Through the use of IOR and OBDfilter tests, NERSC staff were able to closely monitor the performance of the file system at regular intervals to inform the process and chart progress. This paper will document the results of and report insights derived from over 9 months of NERSC’s continuous performance testing, and provide a comprehensive discussion of the tuning and adjustments that were made to improve performance.

Francieli Boito, Jim Brandt, Valeria Cardellini, Philip Carns, Florina M. Ciorba, Hilary Egan, Ahmed Eleliemy, Ann Gentile, Thomas Gruber, Jeff Hanson, Utz-Uwe Haus, Kevin Huck, Thomas Ilsche, Thomas Jakobsche, and Terry Jones, Sven Karlsson, Abdullah Mueen, Michael Ott, Tapasya Patki, Ivy Peng, Krishnan Raghavan, Stephen Simms, Kathleen Shoga, Michael Showerman, Devesh Tiwari, Torsten Wilde, Keiji Yamamoto, "Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations", Proceedings of 2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops), October 31, 2023, 37-43, doi: 10.1109/CLUSTERWorkshops61457.2023.00016

Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in the development of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops in autonomous computing and self-adaptive systems. By defining and developing such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across different sites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.

Harold E.B. Dennis, Adam S. Ward, Tyler Balson, Yuwei Li, Robert Henschel, Shawn Slavin, Stephen Simms, Holger Brunst, "High Performance Computing Enabled Simulation of the Food-Water-Energy System: Simulation of Intensively Managed Landscapes", PEARC17, New York, NY, USA, Association for Computing Machinery, 2015, 1--10, doi: 10.1145/3093338.3093381

Robert Henschel, Stephen Simms, David Hancock, Scott Michael, Tom Johnson, Nathan Heald, Thomas William, Donald Berry, Matt Allen, Richard Knepper, Matthew Davy, Matthew Link, Craig A. Stewart, "Demonstrating lustre over a 100Gbps wide area network of 3,500km", SC '12, Washington, DC, USA, IEEE Computer Society Press, 2012, 1--8, doi: 10.1109/SC.2012.43

Scott Michael, Liang Zhen, Robert Henschel, Stephen Simms, Eric Barton, Matthew Link, "A study of lustre networking over a 100 gigabit wide area network with 50 milliseconds of latency", DIDC '12, New York, NY, USA, Association for Computing Machinery, 2012, 43--52, doi: 10.1145/2286996.2287005

Joshua Walgenbach, Stephen C. Simms, Kit Westneat, Justin P. Miller, "Enabling Lustre WAN for production use on the TeraGrid: a lightweight UID mapping scheme", TG '10, New York, NY, USA, Association for Computing Machinery, 2010, 1--6, doi: 10.1145/1838574.1838593

Scott Michael, Stephen Simms, W. B. Breckenridge, Roger Smith, Matthew Link, "A compelling case for a centralized filesystem on the TeraGrid: enhancing an astrophysical workflow with the data capacitor WAN as a test case", TG '10, New York, NY, USA, Association for Computing Machinery, 2010, 1--7, doi: 10.1145/1838574.1838587

Stephen C. Simms, Gregory G. Pike, S. Teige, Bret Hammond, Yu Ma, Larry L. Simms, C. Westneat, Douglas A. Balog, "Empowering distributed workflow with the data capacitor: maximizing lustre performance across the wide area network", SOCP '07, New York, NY, USA, Association for Computing Machinery, 2007, 53--58, doi: 10.1145/1272457.1272465

I. Foster, J. Gieraltowski, S. Gose, N. Maltsev, E. May, A. Rodriguez, D. Sulakhe, A. Vaniachine, J. Shank, S. Youssef, D. Adams, R. Baker, W. Deng, J. Smith, D. Yu, I. Legrand, S. Singh, C. Steenberg, Y. Xia, A. Afaq, E. Berman, J. Annis, L. a. T. Bauerdick, M. Ernst, I. Fisk, L. Giacchetti, G. Graham, A. Heavey, J. Kaiser, N. Kuropatkin, R. Pordes, V. Sekhri, J. Weigand, Y. Wu, K. Baker, L. Sorrillo, J. Huth, M. Allen, L. Grundhoefer, J. Hicks, F. Luehring, S. Peck, R. Quick, S. Simms, G. Fekete, J. vandenBerg, K. Cho, K. Kwon, D. Son, H. Park, S. Canon, K. Jackson, D. E. Konerding, J. Lee, D. Olson, I. Sakrejda, B. Tierney, M. Green, R. Miller, J. Letts, T. Martin, D. Bury, C. Dumitrescu, D. Engh, R. Gardner, M. Mambelli, Y. Smirnov, J. Voeckler, M. Wilde, Y. Zhao, X. Zhao, P. Avery, R. Cavanaugh, B. Kim, C. Prescott, J. Rodriguez, A. Zahn, S. McKee, C. Jordan, J. Prewett, T. Thomas, H. Severini, B. Clifford, E. Deelman, L. Flon, C. Kesselman, G. Mehta, N. Olomu, K. Vahi, K. De, P. McGuigan, M. Sosebee, D. Bradley, P. Couvares, A. De Smet, C. Kireyev, E. Paulson, A. Roy, S. Koranda, B. Moe, B. Brown, P. Sheldon, "The Grid2003 Production Grid: Principles and Practice", IEEE Computer Society, 2004, 236--245, doi: 10.1109/HPDC.2004.36

Peng Wang, George Turner, Daniel A. Lauer, Matthew Allen, Stephen Simms, David Hart, Mary Papakhian, Craig A. Stewart, "LINPACK Performance on a Geographically Distributed Linux Cluster", IPDPS '04, IEEE Computer Society, 2004, 245b--245b, doi: 10.1109/IPDPS.2004.1303301

Craig A. Stewart, Christopher S. Peebles, Mary Papakhian, John Samuel, David Hart, Stephen Simms, "High performance computing: delivering valuable and valued services at colleges and universities", SIGUCCS '01, New York, NY, USA, Association for Computing Machinery, 2001, 266--269, doi: 10.1145/500956.501026

Presentation/Talks

Stephen C Simms, Matt Davy, Bret Hammond, Matt Link, Craig Stewart, Randall Bramley, Beth Plale, Dennis Gannon, Mu-Hyun Baik, Scott Teige, John Huffman, Rick McMullen, Doug Balog, Greg Pike, All in a day s work: advancing data-intensive research with the data capacitor, SC '06, Pages: 244--es 2006, doi: 10.1145/1188455.1188711

Posters

Stephen C. Simms, Craig A. Stewart, Scott D. McCaulay, "Cyberinfrastructure resources for U.S. Scholarship: the TeraGrid", SIGUCCS '08, Pages: 341--344 2008, doi: 10.1145/1449956.1450057