Connecting Experimental Facilities to HPC
Introduction
NERSC is the production HPC & data user facility for DOE Office of Science research. Many NERSC users also leverage experimental facilities and have built interfaces between facility resources in their workflows. These inter-facility linkages tend to very project specific. Each science team has their own goals and methods. Where possible, this page intends to share inter-facility methods that have demonstrated success and may be re-usable. Sharing best practices in experimental and observational workflows that require HPC for data analysis or simulation between teams is encouraged. Many factors involving software complexity, team size, and the production demands of the workflow shape the best choices, below are some methods to consider.
Why and when to use HPC?
Parallel computing can bring computational intensity to problems which may be decomposed into many concurrent compute tasks. When the size or number of compute problems becomes a problem of scale which limits progress, consider HPC. Parallel computers aren't a solution to every problem, but over many decades NERSC has seen them become useful in a widening array of areas.
The total data analytics demand from an instrument (telescope, microscope, beam line, etc.) may at times exceed the computing capabilities near to the instrument. Likewise running a simulation concurrently with an experiment may provide unique opportunities for model comparison. Shortening time to answer can often improve science workflows. HPC requires attention to many factors, some generalize and some don't.
Whatever the motivation, it is important to identify the I/O and computing rates demanded to see if the problem is suitable for HPC. Small scale problems may not run faster on a supercomputer. If you are the owner/operator of the instrument in question, consider the operational modes of its users to inform, e.g., how much computing can be done locally versus remotely. The cadence of the computing need and it's schedulability are often worth comparing across already established scientific workflows. Building HPC into new or existing scientific workflows is not often easy, but sharing what works can make it easier.
Think about algorithms and their limits. If your workflow has a provable (or obvious) bottleneck, can you identify it? The venerable matrix-matrix multiply has brought many communities to parallel computing. Well established and scalable solutions exist for some HPC problems as ready-made software. Knowing what the algorithmic bottleneck is will help establish the size and direction of your next steps. This applies to computing and data equally. Scalable molecular dynamics codes and Globus being examples of each which are not a "heavy lift" to operationalize on HPC. If your algorithm is new, that's another sort of interesting discussion. At NERSC we encourage re-use of successful methods and shared effort between facilities in building new infrastructures.
Fig 1) Some examples of experimental science facilities that already plug into NERSC and why. Widely different science campaigns are integrating HPC methods into their workflows to bring speed and/or increased model fidelity. Repurposing previously successful methods where possible can lower the cost of HPC integration into new science campaigns.
Best Practices in Planning
-
Bound whether HPC is a good solution to your problem, or not. Get to know the basic “speeds and feeds” of your particular challenge. Data challenges vary widely in their nature, some basics cross-cut data. How much data? how fast is the answer needed?
-
Evaluate software choices via a process, not by "gut feel”. Choosing software is consequential aspect of workflow design. Software can “go away” what will you do? How much reliability do you need? How many 9s? Who will run/maintain/support the software? Who are the other stakeholders in your software with whom you can share costs/burdens over time? Building elaborate software that does not require elaborate maintenance is an enduring challenge.
-
Draft a Data Management Plan (DMP) for your project early, make it integral to your project, keep it active, when it changes notify partners/stakeholders. Many existing DMPs can inspire your own, benefit from what works. Consider the broader data lifecycle.
-
Measure the value of your data if you can. Metrics of data utility can be useful within a project to inform e.g. storage choices, but also in reporting the success of your DMP.
-
Cost the stakeholder/participant roles in your DMP. Build (only) software that matches your team Size, skill, timeframe (how many years will this work?) Borrowing working software methods is encouraged, but some factors require new effort or cost. Storage, software integration, and algorithm development all have costs.
-
When data challenges emerge on your R&D horizon, let people know. NERSC plans new HPC infrastructure years ahead. If you have a long term workflow with long planning horizons, let us know so we can incorporate your requirements.
Best Practices in Resource Identification
NERSC is not a monolithic resource. Identifying the set of NERSC services, filesystems, queues, etc. best suited to the needs of the workflow can be done in dialogue with NERSC staff. Some NERSC data and computing resources are tiered wrt. a capacity or capability. Choosing a filesystem or SLURM QOS from those tiers can be done with composability in mind, allowing more flexible reconfiguration when resources change. A small service which is needed persistently may be better suited to run on SPIN nodes, but can talk to jobs running in the batch pool. Running in batch is very different than running on a DTN or science gateway node. A detailed examination of, e.g. "where should my data go?" is available to inform tiered filesystem choices. Identifying the right resource is often shaped by questions of "how much data/compute/io?" See below.
When resources can be identified systematically and their interfaces programmed, APIs (application programming interfaces) can formalize these interactions and aid automation. NERSC has exposed APIs and APIs under-development. We'd like to learn about your API and how we can interface resources. RESTful APIs allow a common template for resource exposure.
Best Practices in Data Movement
The end-to-end bandwidth between an experimental facility and NERSC is shaped by many factors. ESnet provides a network backbone capable of high bandwidth transfers. Linkages between where the data is generated and ESnet's border are valuable areas to study. Data transfer nodes (DTN's) are a templated solution for capable endpoints and are deployed at NERSC and also at many experimental facilities. At NERSC Globus is a widely shared performant solution for many teams. Best practices in Globus@NERSC are described in technical detail here.
The portal at http://fasterdata.es.net/ provides additional starting points.
If you are transferring data NERSC consider the target filesystem as well. The end-to-end performance will be buffered by the sending and receiving filesystems' capacities to handle the bandwidth. At NERSC we provide tiers of storage related to different needs. NERSC provides $HOME, $SCRATCH, and other filesystems which have their own inherent bandwidth and IOPs capabilities. High bandwidth resources are more time shared and the purge policies express this time sharing. The "Where should my data go?" doc has more detailed guidance.
Streaming data directly to node memory is possible, in which case filesystems are avoided altogether. Many science campaigns have improved the end-to-end performance of their workflow by minimizing the number of "hops" or copies made of the data as it transits the workflow.
If retrieval latency is not a concern, HPSS provides archival storage in large volume. More info here https://docs.nersc.gov/filesystems/
If you are publishing data from NERSC many of the same principles apply. Additionally, web-based access methods are popular. NERSC provides web services and supports for many science gateways for this purpose. https://www.nersc.gov/systems/spin/
Ultimately, the network, hardware, and software "on both sides" will determine how data can move. NERSC and ESnet can help understand the end-to-end bandwidth in a workflow to identify data bottlenecks between facilities. Software defined networking options are available to shape the end-to-end path between DTNs.
Best Practices in Queueing
NERSC is a shared resource that runs at high utilization. Queue wait is common concern. Queues control access to batch nodes and are divided into different qualities of service (QOSs). Understanding the steps in a workflow can help select the QOS that best suites the needs of workflow users/stakeholders. Realtime and interactive options are available, but must be scaled to fit the available resources. A useful technical survey of methods in queuing related to facility workflows.
Consult https://docs.nersc.gov/jobs/policy/ for more information.
Workflows that require an answer from HPC within a given time window can schedule reservations within slurm to coordinate timings. Get to know the queues best suited to your project's needs and use the longest latency queue possible. A detailed view of queue wait times by QOS is available here. https://my.nersc.gov/
Experimental facilities also have queues. Matching the timing of actionable data analysis or simulation to the experimental schedule is often needed. Telescopes which collect nightly data e.g. may coordinate their analysis through a QOS that is under 24 hours. Beamlines that require data analysis "on shift" may select QOS's to serve hour to minute job turnover. The key aspect is to marshal the right amount of computing at the right time for the end-to-end research goal.
Outages occur. NERSC is committed to high availability, but like all facilities does itself take downtimes. Scheduled and unscheduled downtimes on the NERSC side and also on the experimental facility side are a reality which can be planned for. Resilient workflows sometimes federate their resources to allow partial operation while one facility is down. Note that when NERSC's batch compute pool is down, often the peripheral data services and supports are maintained in operation. This e.g. allows access to login nodes, SPIN, or DTN despite the main compute resource being unavailable. See resource identification practices above.