CS Chang

FES Requirements Worksheet

1.1. Project Information - Center for Plasma Edge Simulation

Document Prepared By	CS Chang
Project Title	Center for Plasma Edge Simulation
Principal Investigator	CS Chang
Participating Organizations	New York University, ORNL, PPPL, LBNL, MIT, Columbia U., Rutgers U. Lehigh U., Georgia Tech, Auburn U., U. Colorado, U. California at Irvine, Caltech, Hinton Associates
Funding Agencies	DOE SC DOE NSA NSF NOAA NIH Other:

2. Project Summary & Scientific Objectives for the Next 5 Years

Please give a brief description of your project - highlighting its computational aspect - and outline its scientific objectives for the next 3-5 years. Please list one or two specific goals you hope to reach in 5 years.

Develop the XGC large scale edge kinetic codes further for higher fidelity simulation of the electromagnetic multiscale edge physics in ITER. Using the kinetic codes, perform integrated simulation among kinetic, MHD, neutral particles, and atomic physics for higher fidelity understanding of the multiscale edge physics and the wall heat-load.

3. Current HPC Usage and Methods

3a. Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

Our current primary codes are XGC0 and XGC1 particle-in-cell codes. Particle motions are described by Largrangian equation of motion in 3D cylindrical coordinate system on realistic toroidal geometry with magnetic separatrix. Either Runge-Kutta or predictor-corrector methods are used. The field quantities are evaluated on grid mesh using linear multigrid PETSc solvers. The scale of the simulation is characterized by grid size, which then determines the number of particles. The parallelism is expressed by MPI/OpenMP hybrid.

3b. Please list known limitations, obstacles, and/or bottlenecks that currently limit your ability to perform simulations you would like to run. Is there anything specific to NERSC?

Current limitation we are trying to overcome is in the algorithm which enables the fully electromagnetic turbulence simulation within the 5D gyrokinetic formalism. This is related to the time-step resolution of the fast electron motions, hence to the computing power. Presently, 5D ion full-f simulation of the DIII-D edge plasma requires 20 hours of simulation on 100,000 processor cores. Other machines can require more computing power. Fluid-kinetic or split-weight electron simplification technique can significantly reduce the computing-power requirement, by demanding only 2 times more number of processor cores instead of factor of 60 (full electron kinetics in deuteron plasmas). The electromagnetic simulation of DIII-D edge plasma thus requires 200K processor cores on Franklin for one-day completion. With higher computing power, the electrons can be simulated in full-velocity function, instead of the simplified velocity space function. ITER simulation will require about 1 million processor cores for a 20 hour run, assuming that the current linear scalability holds.

3c. Please fill out the following table to the best of your ability. This table provides baseline data to help extrapolate to requirements for future years. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions.

Facilities Used or Using	NERSC OLCF ACLF NSF Centers Other:
Architectures Used	Cray XT IBM Power BlueGene Linux Cluster Other:
Total Computational Hours Used per Year	65,000,000 Core-Hours
NERSC Hours Used in 2009	8,000,000 Core-Hours
Number of Cores Used in Typical Production Run	15,000 - 170,000
Wallclock Hours of Single Typical Production Run	20-100
Total Memory Used per Run	40 GB
Minimum Memory Required per Core	0.3 GB
Total Data Read & Written per Run	5,000 GB
Size of Checkpoint File(s)	1,000 GB
Amount of Data Moved In/Out of NERSC	10 GB per day
On-Line File Storage Required (For I/O from a Running Job)	4 TB and 3,000 Files
Off-Line Archival Storage Required	1 TB and 30 Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.

As the system becomes larger, early detection for the processor failure is an important issue for an HPC code such as XGC1. A detection software, which can be compiled and run together with an HPC code can be quite helpful. If the software can replace the faulty node with another one, it will be even better. Otherwise, the code run can be stopped for a restart.

4. HPC Requirements in 5 Years

4a. We are formulating the requirements for NERSC that will enable you to meet the goals you outlined in Section 2 above. Please fill out the following table to the best of your ability. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions at the workshop.

Computational Hours Required per Year	500,000,000
Anticipated Number of Cores to be Used in a Typical Production Run	1,000,000
Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above	20-100
Anticipated Total Memory Used per Run	100,000 GB
Anticipated Minimum Memory Required per Core	0.1 GB
Anticipated total data read & written per run	25,000 GB
Anticipated size of checkpoint file(s)	5,000 GB
Anticipated Amount of Data Moved In/Out of NERSC	50 GB per day
Anticipated On-Line File Storage Required (For I/O from a Running Job)	5 TB and 3,000 Files
Anticipated Off-Line Archival Storage Required	10 TB and 100 Files

4b. What changes to codes, mathematical methods and/or algorithms do you anticipate will be needed to achieve this project's scientific objectives over the next 5 years.

A new physics algorithm is needed, and currently under development, which utilizes the full-f ions and the delta-f electrons. Fully parallelized particle and grid data are needed for enhanced data locality and reduced memory requirement.

4c. Please list any known or anticipated architectural requirements (e.g., 2 GB memory/core, interconnect latency < 1 μs).

0.1 GB memory/core

4d. Please list any new software, services, or infrastructure support you will need over the next 5 years.

Since XGC1 is anticipated to be running at near the maximal capacity of the new machines over the next 5 years, efficient fault tolerance services will be needed.

4e. It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores, perhaps GPUs or other accelerators. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

We have been quite successful in adapting to the current level of multi-core architecture by using the MPI/OpenMP hybrid mode. We find that the all-OpenMP operation per node is not the optimal solution on 12 core XT5. Instead, the solution was two Open-MP processes per node. As the core numbers increase per processing element, our low level strategy is to find the highest-performance mixture between OpenMP and MPI per node. At a higher level, our 3-5 years strategy is to develop asynchronous algorithms for effective utilization of many heterogeneous cores. A run-time scheduler such as StarPU will be used to coordinate and map threads to computational resources. Another approach is the incorporation of partitioned global address space (PGAS) languages to offer means for expressing locality of data.

We are also looking into the GPGPUs. Sparse matrix-vector multiply has been demonstrated on GPGPU and is supported by optimized library from Nvidia. Similarly, multi-grid has been demonstrated to have efficient implementation on GPGPU. These are expected to be useful for the XGC paricle-in-cell code.

New Science With New Resources

To help us get a better understanding of the quantitative requirements we've asked for above, please tell us: What significant scientific progress could you achieve over the next 5 years with access to 50X the HPC resources you currently have access to at NERSC? What would be the benefits to your research field if you were given access to these kinds of resources?

Please explain what aspects of "expanded HPC resources" are important for your project (e.g., more CPU hours, more memory, more storage, more throughput for small jobs, ability to handle very large jobs).

We will have higher fidelity 5D gyrokinetic simulation of the ITER plasma including kinetic electrons and electromagnetic turbulence in realistic diverted magnetic field geometry. This will allow us to understand many unknown, but necessary, physical phenomena for successful ITER research, which have not been possible using the present reduced models.

Ability to handle very large jobs and more CPU hours are important for our project.