Minutes
Minutes of the June 1997 ERSUG Meeting
An ERSUG meeting was held at Princeton Plasma Physics Laboratory June 5-6, 1997. Here is the summary by the ExERSUG secretary.
The meeting opened with a welcoming word from Dale Meade, associate director of PPPL, and a brief description of the PPPL local computing facilities by Charles Karney. PPPL has a cluster of workstations and Xterminals, with a DEC 2100/4 and several Sun Ultra2/2200 at the high end. They share filesystems with NERSC through AFS and through NFS-mounts of the PPPL disks. They want to expand their local capabilities to complement the MPP system at NERSC.
Bill Kramer, Head of High Performance Computing at LBL (i.e., Head of NERSC) provided an overview of the NERSC status and the plans for FY98. Summary for 1997:
• Decided upon HPSS for future archival storage.
• Accepted the 160-node T3E-600.
• Ordered T3E-900/512 for July 15 delivery.
• All J90's upgraded to J90se.
• HPSS testbed installed.
• Met with all Grand Challenge Application projects.
• Collaborated with RHIC/STAR experiment for proposal.
• Developed an LDRD for PDSF/HENP/DPSS (high energy physics computing)
Staffing is now at 45 or 46, to increase to 62. There are open positions across the board, particularly in Systems. The T3E has had 93 percent availability, other systems above 98 percent.
Continuing with Bill Kramer's presentation: Archival storage is to evolve from CFS and Unitree towards HPSS (joint project of IBM and several DOE laboratories), global storage is to move from AFS, NFS, and CSH towards NFS/3 and DFS, and at some point the DFS and HPSS environments should merge into one. HPSS should go into production in Fall '97. A LAN upgrade to 100Mb/1000Mb Ethernet or ATM is planned for Fall '98. The next major procurement should get underway by Fall '97, with the Request for Proposals released to vendors by Summer '98 and delivery as early as January 1999. A 64-node Origin-2000 will be on loan to NERSC for a year starting in Summer '98.
Tom Kitchens presented the View from Washington. There is a $100M Next Generation Internet / Internet-2 (NGI/I2) initiative, with $35M in the DOE budget, but the balance between funds for research and funds for deployment as well as between national labs and universities is quite unclear. The DOE2000 initiative is moving along, with two main thrusts: the National Collaboratories (contact person, Mary Ann Scott), and the Advanced Computation and Simulation Toolkit (ACTS) (contact person, Rod Oldehoeft). Within ACTS there is a solicitation for work on a Scientific Template Library (SciTL), closing July 15. There are two pilot collaboratories: a Materials Micro Characterization Collaboratory and a Diesel Combustion Collaboratory, and there are seven prioritized R&D projects.
Jim Craw described the status and plans for the Parallel Vector Processor (PVP) systems -- the J90's and C90. We now have four J90 machines with 24 CPU's each, all of the J90se (scalar enhanced) variety. The NQS batch system will be replaced by NQE, described as an intelligent front-end to NQS. This will be a requirements based scheduler. Users specify their requirements (memory, Nr. PE's, disk, time) via 'cqsub'. Jim Craw's presentation was the occasion for discussion of problems that PPPL users are experiencing in connection with the absence of the traditional Cray Fortran-77 compiler on the J90's. The C90 will stay at least until February 1998, and will not disappear before delivery of a J90++. The J90++ will have a processor speed equivalent to that of the C90, and will have 24 CPU's and larger memory.
Jim Craw continued to describe the plans for the Common Super Home (CSH), which is already available on the J90 cluster. The intent is to expand CSH throughout the NERSC compute environment. NERSC plans to eliminate the chronically full /wrk system on the C90 and replace it with temporary storage (/big) and separate near-term storage (/home). The system will use NFS/3 initially with future migration to DFS/HPSS. The new superhome should be implemented across the NERSC platforms by October, 1997.
Tammy Welcome, speaking on behalf of Francesca Verdier, presented a summary of the early user experiences on the PEP (the 160-node T3E). There are now 227 users in 73 repositories. The four repositories that have made the most use of the machine to-date are Kilcup (HNP) for Structure of Subatomic Particles, Louie (BES) for Materials Properties, Toussaint (HNP) for Quark-Gluon Plasma, and Soni (HNP) for Hadronic Matrix Elements. The next five repositories by total time used are Ryne (HNP), Rokhsar (HER), Carlson (HNP), Majer (BES), and Dixon (CTR). Users seem to agree that the T3E has good scalability and good uptime, and they approve of the large memory compared with the T3D. Outstanding problems are a slow compiler, a need for queues that admit longer jobs, bugs in f90, and the slow memory bandwidth.
Jim Craw presented the T3E upgrade plans. A T3E-900 with 512 application nodes will be delivered by July 15, and the previous T3E-600 PEP system will remain available throughout the testing and acceptance period. The Streams problem on the T3E-600 is no longer present on the T3E-900. The final acceptance test on the T3E-900 will require the system to have two FDDI interfaces, two HiPPI interfaces, 170 Fiber Channel Disks and 96 SCSI disks. If all goes well the acceptance test will be over in September or October.
Tammy Welcome described NERSC support for the Grand Challenge (GC) projects. They are:
• Materials, Methods, Microstructure and Magnetism. Malcolm Stocks, ORNL.
• Numerical Tokamak Turbulence Project. Bruce Cohen, LLNL.
• Particle Physics Phenomenology from Lattice QCD. Rajan Gupta, LANL.
• Relativistic Quantum Chemistry of Actinides. Robert Harrison, PNNL.
• Computational Accelerator Physics. Robert Ryne, LANL.
• Analysis of High Energy Nuclear Physics Data. Doug Olsen, LBNL.
• Computational Engine for Analysis of Genomes. Ed Uberbacher, ORNL.
The GC projects receive about half of the T3E resources in FY97, but they no longer receive special time on the C90. Each GC project has an assigned person of contact (POC) at NERSC, and NERSC has had site visits at all but one of the GC projects. NERSC makes special efforts to assist the GC projects with their needs for mathematical and other software, with performance tuning, and with the use of tools.
An agenda item on plans for the Origin-2000 was dropped as this machine is only planned to come online in Summer, 1998. See, however, Bill Kramer's State of NERSC opening presentation.
Bill Kramer presented NERSC's plans for the future of the SAS. The present SAS (a four-processor Sun Sparc system) received logins from only about 6 percent of active NERSC users over the past few months. All NERSC users were asked for input regarding the SAS future via a Web survey, and 24 responses were received. The main requests were for visualization and symbolic manipulation software. NERSC proposes to concentrate the SAS replacement on these two areas. It may be cost-effective to enable the J90 system as a symbolic manipulation server, meaning to install Matlab, Mathematica, Maple, and other such packages. A remote visualization server is planned for the Fall of 1997 and this system would be integrated in the NERSC storage environment and support remote visualization applications. Direct ATM links are being considered.
Nancy Meyer presented the plans for transition to the High Performance Storage System (HPSS). HPSS can employ parallel network transfer and parallel striping to both disk and tape. It requires the Distributed Computing Environment (DCE) and an interface from the Distributed File System (DFS) to HPSS is in development. An early implementation is already available at NERSC, and a production environment is to come online in the Fall of 1997. It will have an expansion capacity to about 1000 TB. This system will replace both CFS and NSL/Unitree.
Keith Fitzgerald discussed other aspects of the evolution of mass storage at NERSC. AFS is on-line and stable, while DFS is in development and should be available to first users in July, 1997. General user migration may start in about December, 1997. There are two archival systems: CFS and Unitree. CFS is slow but stable and currently stores 17 TB from 2.8M files. Unitree is a hierarchical storage system (disk and tape) and now holds 7 TB from 369,000 files. Transfer rates to Unitree vary a great deal, from a high of 47MB/s (Peak rate from the C90) to peak rates of 2MB/s from the T3E and the batch J90's; the latter machines lack an IPI3 HIPPI driver. NERSC was planning to start charging for storage after the present ERSUG meeting, but these plans were put on hold for another half year as the storage group decided there is not currently a problem and CRU based charging needs further consideration. There was some discussion of this issue at the meeting, and a user group consisting of Bas Braams, Jean-Noel Leboeuf and Jerry Potter will work with NERSC to try to define a policy that is accepted by all.
Bruce Ross, Assistant Director of the Geophysical Fluid Dynamics Laboratory (GFDL), a neighbor of PPPL, discussed GFDL's experience with both a T3E and Cray's T90. Their T3E-900 is a relatively small system with 40 application nodes each with 128 MB and a total of 150 MB disk. The T90 has 26 IEEE CPU's, 4GB of CM03 central memory, 32 GB SSD and 470 GB disk storage. They also have an STK Redwood/Timberline tape archive with up to 240 TB capacity. They have about 50-70 heavy users, 6-8 major models (geophysical codes), 15-20 other important models, and a host of supporting analysis codes. There have been some problems with both the T3E and the T90 systems, but both were installed only recently. The T90 is the production platform. GFDL recognizes the need to rewrite their major models for modularity, documentation, and clean style. The size of recent applications requires multitasking on the T90.
Jerry Potter followed with a short description of climate modeling at NERSC by his group. Their focus area is to test all the major models on observed weather and climate patterns. One issue is to assess confidence bands for predictions, and this requires many runs with slightly perturbed conditions. They have early and favorable experience on the NERSC T3E for this application.
Bill Saphir gave an overview of the activities of the Future Technology Group. The group's activities are guided by the premise that NERSC must be actively involved in new technology to remain at the forefront of High Performance Computing. Among the group's activities are:
• Evaluation of technology produced by the UCB NOW (Network of Workstations) project.
• The COMPS (Cluster of Multiprocessor Systems) project, which is seen as a prototype for the next generation of parallel supercomputers. Among the activities here are the development of communication infrastructure for COMPS and evaluation of the proposed VIA (Virtual Interface Architecture) standard.
• Participation in the UCB Intel Millenium project, which will establish a number of PC clusters on the UC campus and a small cluster at NERSC.
• The MPI-2 standard, which is now almost finished.
• Writing a proposal for DOE 2000 ACTS/SciTL tools. Bill requests input from users who have used these tools or are thinking of using these tools.
• Using e.g. VXtreme, RealVideo and Mbone to make NERSC seminars/classes av-ailable over the web.
Sandy Merola led a discussion on the evolution of ESNET. The NGI offers opportunities for advancement of networking infrastructure, research and applications. However, the NGI may result in a smaller number of direct connections from the university community to ESnet. Sandy asked for input in regards to NERSC requirements for additional network connectivity while noting the already planned upgrades to ANL, ORNL, LBNL, LANL, and MIT. NERSC PIs are requested to provide requirements for improved ESnet connectivity or services so they may be included in the ESnet program Plan, currently in draft.
Ricky Kendall reported on two action items from the January, 1997, ERSUG meeting. The final editing of the Greenbook has been delayed and will be done over the summer, in time for the next procurement at NERSC. Ricky calls on all interested users to read the draft that is on the ERSUG pages at www.nersc.gov and send in comments. The Users Helping Users group still needs a chairperson. It could be valuable for this group to get the Grand Challenge code authors and other principal T3E users together for a workshop in which they describe to each other their projects.
Finally the ExERSUG meeting was merged into the tail end of the ERSUG meeting. We have the following action items (Bas Braams):
• Finish the Greenbook. The group that was asked at the previous ERSUG meeting to do this consists of Ricky Kendall, Bas Braams, Mike Minkoff, Maureen McCarthy, together with Phil Colella by later invitation.
• Find a chairperson for Users Helping Users to get this activity going. Nominations, anyone?
• A user group consisting of Bas Braams, Jean-Noel Leboeuf, and Jerry Potter will work with NERSC to try to resolve storage policies and charging and quota issues.
• Plan for the next ERSUG meeting. With the Video Conferences held every month we think that ERSUG meetings could be held less frequently than twice a year, and we tentatively want to hold the next one in March, 1998. This will fit in well with the schedule for the next procurement. The location could be Berkeley, but Argonne was also mentioned.
• Find out just who is on ExERSUG. Bas Braams will contact the SAC, tell them who we think are their representatives, and remind them that these representatives serve at the pleasure of the SAC members.
The meeting closed shortly before noon on Friday.
Minutes written by Bas Braams, ExERSUG secretary, braams@cims.nyu.edu.