NERSCPowering Scientific Discovery for 50 Years

2009/2010 User Survey Results

Comments


What does NERSC do well?

73: ease of use, good consulting, good staff support and communications
65: computational resources or HPC resources for science
22: good software support
19: overall satisfaction with NERSC
14: good documentation and web services
10: queue management or job turnaround
 8: data services (HPSS, large disk space, data management)
 4: good networking, access and security

Their responses have been grouped for display as follows:


What can NERSC do to make you more productive?

27: Improve turnaround time / get more computational resources
26: Implement different queue policies
15: Provide more software / better software support
11: Provide more consulting / training / visualization
11: Things are currently good / not sure
10: Additional / different data storage services
10: Improve stability/reliability
 6: Provide more memory
 3: Better network performance
 3: Other comments

If there is anything important to you that is not covered in this survey, please tell us about it

4: Software comments
4: Storage comments
3: Job and Queue comments
2: Performance and Reliability comments
3: Other comments

 


What does NERSC do well?   132 respondents

  NERSC's hardware and services are good / is overall a good center

Everything. It is a pleasure to work with NERSC.

User support is very good. Diversity of computational resources.

User support is fantastic - timely and knowledgeable, including follow-up service. New machines are installed often, and they are state-of-the-art. The queues are crowded but fairly managed.

Everything that is important for me. This is a model for how a computer user facility should operate.

Provides state-of-the-art parallel computing resources to DOE scientists, with an emphasis on usability and scientific productivity.

NERSC handles computer problems so I do not have to and I can focus on chemistry.

Everything. Best in the DOE. I have used them all. NERSC is #1.

Everything, I'm extremely satisfied

Provide high quality computing resources with excellent an excellent technical support team.

NERSC has good machines and a good service

Communication and reliability.

NERSC is one of the largest and most reliable computing resources for our research group. The hardware performance and technical support staff are very exceptional.

Very well

Provides a robust computational facility and overall I am very satisfied with the facility.

NERSC is the exemplar supercomputing center in my opinion! In basically every category, NERSC gets it right. The computational performance of the systems is extremely high, the consulting help is immediate and helpful, and as a user with a special project I have had a great and very helpful relationship with developers at NERSC to help me build the project.

consulting is great, having lots of systems is great and they are reliable

t has taken only one month since I got my account, but so far I am satisfied.

Website is first class, especially the clear instructions for compiling and running jobs.
The blend of systems available is very nice: I can run big jobs on Franklin and then run various post-processing operations on Euclid etc.

The NERSC user experience is great. I think NERSC is performing well, esp. for a government organization. The phone support is of high quality and very reassuring to have. The computers are well run.

Almost in every respect (in comparison with others).

I think NERSC is doing extremely well. Nearly every staff, except one person, is very quick and responsible. One thing that I love NERSC is that they think in a way as a researcher, not as a system administrator. I think every national lab must learn from NERSC. NERSC is a role model and a leader in real life, not just on the web site.

Consulting. Having accessible computers with reasonable security. Fair sharing of time for a wide-range of job sizes, especially the "medium" scale jobs.

From my own experience, NERSC provides desired computational resources for research involving large scale scientific computation. The computational systems are very well-maintained and stable, which allows hassle-free scientific investigations. My feeling is that NERSC systems just like a reliable long-standing desktop with a magic power of computation. NERSC also provides a huge collection of software very useful for data analysis and visualization which I will explore more in the near future. The supporting services (both account and technical support) are very friendly, prompt and professional. The website is well-organized and informative. I like the News Center very much.

Access to world-class supercomputing facilities at NERSC is a rare privilege for me as a scientist from Canada. I think to serve the world-wide community of research scientists from diverse fields is what NERSC does best. This in itself is a monumental problem as thousands of researchers are using NERSC facilities for the last decade . Congratulations to all at NERSC for the best work done.
I would like to express my sincerest thanks to all at NERSC and in particular Dr. Ted Barnes, Francesca Verdier and David Turner for their unfailing help and advice which were sine que non for progress in our theoretical and computational research in the area of Physics and Chemistry of systems of Superheavy elements.

NERSC is excellent in terms of serving as a simulation facility. It is somewhat less useful as an analysis / processing facility, partially because of the "exploratory" nature of most of my data analysis. In general, I am very satisfied with performance of NERSC clusters and assistance provided by NERSC staff.

Everything we need. Ms. Verdier is amazing.

Give information about how to use the systems. Also data storage and file sharing through the /project directory system has been extremely helpful. Consulting is always very friendly and helpful as well.

*Good updates of the status of NERCS's machines in general.
*NERSC Offers a good remote access to the machines
*In general all the NERSC's machines are good to perform different kind of simulations
*Good support to the users

good performance, overall good maintenance of quantum chemistry software (although initially, some performance problems with 'molpro' occurred on carver and hopper), superior consulting team, that is very helpful and fast

Very good support, very good documentation for using systems and software, and reasonable queue waiting times.

software, consulting, queue times

I'm very impressed with the number of new systems that NERSC is bringing online this year. At the moment Franklin appears over-committed, however I suspect this will chance when Hopper is upgraded to the XE6 configuration. I'm also hopeful that we can get NAMD running efficiently on Dirac as it has shown impressive GPU benchmarks in the past. Also the support is very good, I usually get answers to my questions within a couple of hours.

* keep systems stable
* very good account management tools (nim), good software support

I shall repeat myself from previous years. NERSC provides a robust, large scale computing platform with a fair queuing system. Recently, it has facilitated code development to the point where you can 'make the leap' to the larger platforms such as franklin or jaguarpf. With the introduction of Carver it has addressed the problems of the small memory per process that the XT machines had.

I am impressed with the service both in terms of software maintenance and in assistance with problems. The resources are current and powerful. I am surprised that users are not making more use of the carver/magellan machines.

Just about everything. The machines are easy to use, they are fast, they are always accessible, etc. The people I've contacted by either phone or email have always been very helpful.

Keep the machines up and running. Service and consulting. Create a computational environment that is easy to use.

Overall very good.

I find that NERSC work very well for the calculations that we are doing. I find this to be an excellent facility and operation in general. We have been able to accomplish some major works (with ease) on Franklin.

Provides stable and efficient access to HPC computing resources with a good selection of software packages and tools for most scientific computing activities.

NERSC is simply the best managed SC center in the world

Stability, state-of-the-art computers, experiences for running a super computer center.
friendly consultant. and most nice thing is we got lots free time this year during testing period.
(carver, hopper.., nice! :))

HPSS is fast.
Online service is good

Good and varied machines are available. Not too many downtimes. Plenty of software/libraries.

It provides excellent computing resources and support. One thing that is useful and I have not experienced when using other computing resources is when individuals at NERSC take the time to inform you that a particular machine has a low load and that it you submitted your jobs there they will run quickly.

NERSC makes abundant high-performance computing resources available, providing excellent throughput for applications in which many modestly parallel (a few hundred cores each) jobs must run concurrently. HPSS works fast and flawlessly, Franklin uptime is (increasingly) good, and the large variety of available software libraries and tools is welcome. NERSC is also unmatched in the responsiveness and knowledgeability of its consulting staff.

practically everything

I am satisfied with everything about NERSC except for the long batch queue wait times.

NERSC provides stable, well documented computing environments along with a group of well trained and responsive people to help with problems, answer questions, and give guidance. They also continue to upgrade and improve these environments.

I have been and remain very pleased with all aspects of NERSC. The computers are great and the staff are always very helpful. Thanks!!!

NERSC is doing great in many aspect, which makes it a user friendly and efficient platform for scientific computation.

1. computing source is quite good.
2. consultants are so nice.
3. information is accurate.

There is a great range of machines, with very good software availability and support, and short queue times. The global home area has greatly simplified using all machines, and the /project directory makes maintaining software for use within our group very easy. The allocations process is very fair, and our needs are consistently met.

  Provides good machines and cycles

It provides a state-of-the-art parallel computing environment that is crucial for modern computational studies.

It provides a good production level computational environment that is relatively easy to use and relatively stable. I find it easier to do productive work at NERSC than at the other DOE computational resources.

Very good machines and good accessibility

keeps machines running well. franklin has better performance than other xt4 systems I have used.

HPSS is an outstanding resource.

I really like the new machine Carver. It is efficient.

Provides excellent machines to run our calculations on!

high throughput and reliable.

Providing computing resources

NERSC maintains a computer environment that is advanced and reliable.

NERSC systems are consistently stable and reliable, so are great for most development and simulations.

The computational resources are very good.

provide computer computational ability and space

HPSS storage and transfer to and from

The waiting time for each job is short, I love it.

Availability of computational resources is impressive

Very high performance

Provides extensive computational resources with easy access, and frequent maintenance/upgrade of all such resources.

Providing computing resources.

Extremely well with upgrades and transitions to new, improved, faster and better HPC systems.

Maintaining such large systems.

Computer is VERY reliable.
Downtimes are minimal
scratch area is not swept too soon

NERSC is quite powerful in the parallel calculation, which makes it possible to run large jobs. The debug setup is almost perfect which allows to debug quite quickly.

Providing and maintaining HPC facilities.

We run data intensive jobs, and the network to batch nodes is great! The access to processors is also great.

Uptime

NERSC provides about 50% of the CPU that I need. CRAY-XT4/5 are very good platforms, the code I am using (QUANTUM Espresso) performs well.
It is amazing the I/O speed to write restart files.

Provides great stable resources, fast queue times on large jobs.

Provide a variety of computing platforms which support a wide range of high performance computing requirements and job sizes.

NERSC has proven extremely effective for running high resolution models of the earths climate that require a large number of processors. Without NERSC I never would have been able to run these simulations at such a high resolution to predict future climate. Many thanks.

  Good support services and staff

Any time I am having trouble with logging in or resetting my password, I can always call and get immediate, helpful, and courteous assistance from the person that answers the phone.

The account allocation process is very fast and efficient.

Tech support is phenomenal.

NERSC user services has been the best of any of the centers on which I compute. I can not say anything but positive things about the help desk, response times to my requests, allocation help, software help, etc. On those topics, NERSC stands out well above the rest.

NERSC has always been user centered - I have consistently been impressed by this.

consulting

Easy access the documents/website and consult services.

I am very happy with the support and service by Francesca Verdier. Every time I call or write email, she always responses promptly and accurately.

New user accounts!

Keep users informed of the status of the machines.

My interactions with consulting and other nersc support staff are always useful and productive.

Keep users updated about status. Supply information for new users. Have someone available over the phone for account problems.

NERSC supports users better than other computing facilities

Consulting support is great.

In my opinion it is technical support, namely : Woo-Sun Yang and Helen He. Without their help I would not be able to install the models we use for our research.

Overall support from NERSC is encouraging. The technical support and user informations are excellent and helpful. We used to enjoy support for model installation and debugging.

very good user support.

People from support staff at accounts and helpdesk are very helpful and quick to respond.

User Services is excellent.

User services continue to perform very, very well!!

Nersc responds to my needs and questions well. Also, it is pretty straight forward to figure out how to submit and run jobs using the website.

Quick response on technical support requests or bug reports.

Nersc is generally able to meet all my computing needs, but what has always seemed really outstanding to me at NERSC is the consulting help service. It is highly accessible, responsive, and has always resolved my issues -- and the people seem very friendly. Also, the staff that manages accounts seems very easy to work with, and has always helped me to maintain sufficient account status.

Support users

I really like the NERSC Consulting team. They have been very helpful, and responded to my questions and solved my problem in a timely way.

Communicating changes, rapid response to questions

I am really pleased with the NERSC support team. Typically. I can get a response within 20 minutes.

I don't really know that much about NERSC's structure, as I am just a graduate student. My experience with NERSC is limited to running noninteractive lattice QCD jobs on franklin.
NERSC's help desk has been very helpful when I had problems getting my account set up (my password was mailed to the wrong place).

Support team

- very helpful, efficient support staff
- keeps us informed of developments at NERSC
- great details on website for programming environments, software, etc

The staff is very helpful on account issues. Very responsive.

In general NERSC has been more responsive to the needs of the scientific community than other supercomputer centres.

Provide reliable service

The information on the web site is very easy to find and well organised. Response to problems and downtimes are very short.

Compared to other HPC facilites I have used NERSC provides superior support/consulting for the user.

Responsiveness, by telephone, to user questions and needs, password problems and so on, is excellent. It is a pleasure to work with NERSC consultants.

I have been very pleased with the helpdesk response in both time and service.

Good user support and website

NERSC keeps me informed on system changes and responds quickly and helpfully to questions and requests.

  Good software / easy to use environment

NERSC provides the systems and environments necessary for tool development/testing and makes it fairly easy to provide the environments required by production software for performance testing and analysis.

Customer support; software/OS updates

Consultant
Software

NERSC does a very good job at providing a collection of application software, developer tools (compilers, api's, debugging), web pages, monitoring services and tutorials for their large and diverse user base.
Also in recent year's NERSC has done a much better job in setting up accounts for new users.

Precompiled codes are a lifesaver. One machine in the system always has what you need. I get very good performance from everything I use and it all scales wonderfully over the number of processors I use (up to 2k).

  Other comments

In old times (i.e. early 1980's), on line consulting help was readily available, and was very useful for the productivity. I hope I could get that kind of help now.

Though I understand the need for enforcing security with periodic password changes, it is annoying, especially since at the moment on my project I am the only one using the system and so only I know my password.

Making resources hard to get access to

They have a good [comment not finished]

Queue waiting time is too long on hopper.

 


What can NERSC do to make you more productive?   105 respondents

  Improve turnaround time / get more computational resources:   27 comments

The waiting time for very large parallel jobs is prohibitive; I run mainly scalability studies and I require large number of processors for short amount of time (max 10, 15 mins); I have to wait like a week sometimes when asking 16k cores.

Shorter queue times

Shorter queue waits or longer runtimes would be helpful. I run a lot of time-stepping code.

Make the queues shorter.

There are a lot of users (it is good to be useful and popular), and the price of that success is long queues that lead to slower turn-around. A long-standing problem with no easy answer.

Improve batch turnaround time. The amount of time currently needed for a job to start running is long. Specifically, for jobs that involve dependencies, it would help if jobs in the hold state would accumulate priority so that when they're released to the queue they don't have to start from scratch (in terms of waiting time.)

Improve turnaround times on Franklin (somehow)

Queueing time for long jobs, e.g. 24 hrs, can be quite long, even (or especially) with relatively few CPUs (e.g. 100). I understand that this encourages the development of parallelization; however, some problems are notoriously difficult to parallelize, and one often must run for a long time with few CPUs. The ability to run shorter calculations between restarts is a partial solution; the problem is that the queueing times for each calculation may add up to days.

add more processors to carver. The long queue time makes progress slow. Carver clobbers hopper and franklin with the performance increase of my code. Also recompiling the code is much faster on carver. Yet because i have to wait longer to start and restart the simulations, it doesn't get me results faster overall.

Shorten the queue waiting time and speed up the processors. ...

Shorten the computational queue time.

Shorter queue time ...

I think NERSC should have another Franklin. Current waiting time is too long, at least for my jobs. ...

Computational resources are limited, thus resulting in long queues on the larger machines I need to access for my simulations. However, I understand fixing this problem is difficult and bound by economic constraints.

Turn around time is always an issue! More resources would be great!

I feel like queue times are extremely long, especially on franklin. I use the "showstart" command and it gives me one time, then 3 days later my job will run. I do not understand why my jobs continually get pushed back from the original estimates.

Some of the machines has a long waiting time for batch regular jobs. It would be more productive if this waiting time could be somehow reduced.

make wait time less for the queue job

Queue times are a bit long especially for low priority jobs.

throughput is bad -- need to wait >24h, sometimes days for running 24h.
Limit users?

less wait time ...

... and faster queue turnaround time: I know everyone wants this, but it's really the only issue I have had. ...

faster and more CPUs

More cpu cores.

Reduce job queue times. On Franklin one often waits ~1 week to use 16k procs. This is too long to wait. ...

Batch turn around time on Franklin excessive. (Maybe I should be checking out other NERSC machines?)

The only thing that would substantially improve my productivity would be shorter batch queue wait times.

  Implement different queue policies:   26 comments

Longer wall time limits:

NERSC response: In 2010 NERSC implemented queues with longer wait times:

  • On Carver the reg_log queue has a wall limit of 168 hours (7 days) and reg_xlong's is 504 hours (21 days).
  • On Hopper Phase 1 the reg_long queue has a wall limit of 72 hours (3 days).

    The max walltime for interactive job is 30 minutes for the gpu cluster. I find it too short I need to resubmit job constantly as I was debugging my code. It would be nice to make it longer.

    Have some increased time limits for job runs.

    ... Adding more queues to Franklin and Hopper with larger wall clock (may be with smaller number of nodes) time could be very helpful.

    I prefer to have an increased wall clock time, especially in Franklin machine. Franklin queue is more and with short wall time it takes more time to finish the scheduled years model runs. ...

    Some codes, due to the scaling limitations, perform better on smaller number of processors. Therefore, I will be glad to have ability to run such codes longer than 24 hours.

    Add a machine with longer wall-clock limit and less core/node (to save allocation). Not all users have to chase ultra massive parallelization for their research.

    ... Fourth, allow longer jobs (such as one month or half year) on Carver and Hoppers. Let science run to the course.

    Allowed length of job is short (24-48 hours or so). I hope that users can make requests for jobs that can't be done in two pieces. Occasionally, one large phonon spectrum calculation can take something like 200 hours on 64-256 processors.

    Wall clock time limits need to be relaxed. Queuing systems need to be revised in terms of their priorities. In many cases 72 hr wall clock time is relatively short to obtain meaningful results. There should be a category that requires only a modest number of cores (say less than 256) but long wall clock time (up to 100 hrs).

    Please make the low Queue in Carver to be at least 24 hrs.

    NERSC response: Thank you for your feedback regarding NERSC queues. We've increased the queue lengths on the Carver. There are now queues which run for longer than a week. Also, the Carver system does not favor jobs based on size. Small jobs have the same priority as larger jobs.

    Better treatment of smaller core size jobs:

    Stop giving preferential treatment to users who can effectively use large numbers of cores for their jobs. This could be done by giving jobs using small numbers of cores the same or higher priority as those using large numbers and increasing the number of 'small' jobs that can be run concurrently.

    I will be very happy if the queue policy on NERSC will be more favor on those jobs requesting less than 200CPUs and the queue policy will favor jobs with 1/2 days time request. The current queue really favor extremely short jobs, at most 6 hours.

    I often have a spectrum of job sizes to run (e.g., scaling studies, debugging and production runs) and the queuing structure/algorithms seem to be preferential to large runs. It would improve my productivity if this was more balanced or if there were nearly identical systems which had complementary queuing policies.

    Change their priorities for batch jobs so that it is possible to run jobs on 16-64 nodes for 48 -72 hours. Currently such jobs have extremely long wait times which combined with a low cap on total number of jobs in the queue limits the possibility to run such jobs. These job sizes are typical for Density functional theory calculations, one of the main work horses in chemistry and material science and it makes no sense that nersc disfavors these.

    Provide better understanding of how jobs are scheduled / more flexibility in scheduling options:

    ... Second minor complaint:
    Other people's jobs who submit after me, who are asking for the same number of CPUs and length of time and submitted into the same queue, then somehow their jobs miraculously move ahead of mine due to some low-level hidden priority system. Extremely frustrating. I recognize that they think their science is of higher priority than mine and have somehow convinced NERSC that this is the case, but it gives the appearance that my science is of lower priority and valued less. It is my opinion that this practice should be terminated.

    NERSC response: NERSC recognizes that the tools available from our batch software provider do not allow for a good understanding of why jobs "move around" in the queues. As users release jobs put on user hold and as NERSC releases jobs put on system hold, and even as running jobs finish, the priority order in which jobs are listed changes. NERSC is working with moab to obtain a better "eligible to run" time stamp which would help clarify this. Only rarely are these changes in queue placement due to users who have been "favored".

    Tell me how to best submit the kind of jobs I do. They do a good job with the heads up: this machine is empty submit now.

    Provide an interface which can help the user determine which of the NERSC machines is more appropriate at a given time for running a job based on the number of processors and runtime that are requested.
    Alternatively, or in addition, provide the user with the ability to submit a job to a selection of NERSC machines, eventually asking for different # of processors and runtime for each machine based on their respective specifications, and let the "meta-scheduler" run the job on the first available machine.
    In addition, it would be nice to have more flexibility in the scheduling options. For example, being able to give a range of # of processors (and a range of runtimes) would be helpful (e.g., the user sometimes does not care whether the job will be performed using 1024 processors in 1 hour or 256 processors in 4 hours).
    These would help overall productivity and reduce the "need" for some users to submit multiple job requests (sometimes to multiple machines) and kill all but one when the first started.

    Higher run limits:

    Allow me to run more jobs at once, but then that wouldn't be fair to others.

    ... able to run more small jobs simultaneously

    More resources for interactive and debug jobs:

    Increasing the number of available processors for debug queue in Carver will help me to shorten the debug cycle with more than 512 processors. Current it is 256.

    During the working day, i would always encourage the availability of more development nodes over production ones.

    Other comments:

    Would be great to get an email sent out when your jobs are done running.

    Flexibility of creating special queues for short term intensive use without extended waiting time. Hopefully it will not be too expensive, either.

    Better ability to manage group permissions and priorities for jobs, files, etc. The functionality of the idea of project accounts is still relevant.

    The wait time on a "low" priority setting can get somewhat ridiculous. I realize that the wait has to be sufficiently long, or else EVERYONE would use it. However, it seems that something could be done to ensure that X amount of progress is made in Y days for low priority jobs. Sometimes, given traffic these jobs may not progress in the queue at all over weeks and weeks. So .... some tweaking on that could be nice.

    improve the queue structure so that jobs are checked for ability to run before sitting in the queue for long periods of time...I have had a job sit for a day only to find out I had a path wrong and a file could not be found and then had to wait another day to run - if a job has waited its turn, but then is unable to execute the space should be reserved for a replacement job by the same user if submitted in a timely fashion or to be replaced by another job from the user which is still waiting to be run

      Provide more software / better software support:   15 comments

    continuous support for nag math library

    Find a way to allow shared libraries.

    The group I am in actively uses VASP. The Paul Kent version of VASP 4.6 at NERSC is extremely parallel allowing for more processors than atoms to be run and still scale well. Optimizing VASP 5.2 equally well would make me more productive. Although I am not sure NERSC has control over this.

    Debugging with totalview is difficult as the X window forwarding is slow for remote users. However, no easy solution comes to mind. (Other than using gdb, which isn't as nice)

    It would help developers if a more robust suite of profiling tools were available. For example there are some very good profiling tools for franklin, but they are not robust enough to analyze a very large program.

    Could you install gvim (vim -g) on Franklin?
    Is there any way jobs in the queue could have any estimated wait time attached to them based on the number of jobs in front of them?

    The tool showstart for predicting the time a job will spend on the queue before running starts is very inaccurate (so inaccurate it is useless).

    To provide support to code implementation in its platform. It is really painful to go through the process of trying to implement new version of codes and computer environment and libraries changes by NERSC. There is little care in NERSC about incompatibilities produced by changing in hardware, operating system, and libraries. NERSC should provide tools or support to analyze errors coming either from NERSC computer environment changes or new version of the codes. A simple tool to check differences among versions of codes or to correlate errors with OS/libraries changes should be available. It seems that this has to be done in a trial/error basis.

    ... Meteorological data analysis tools like GrADS need to be installed and distributed. This helps for a quick analysis of the desired model output. Now we are taking data to local machine for analysis. I hope the installation of software will save both computational time and man power.

    Sync STAR library more frequently to RCF at BNL would be useful for the data analysis as well as simulation

    environment explanation for STAR, though I am not sure it is your job

    I use PDSF to analyze STAR data. There are many important STAR datasets that are not available on PDSF. That significantly impacts my use of PDSF. Not sure this is exactly a NERSc issue...

    Sometimes we don't get batch nodes and have to talk to Eric Hjort to figure out why we're stuck. Sometimes this has been because the IDL licenses or resource names have changed and we just don't know when stuff like that will happen.
    We are most productive when the software/configuration is stable, but of course upgrading things is a necessary evil so we totally understand.

    Allow subproject management and helpful issue tracker (wikis, as well) ala github.com or bitbucket.org

    Make it easier to add users and collaborate on projects.
    Hosting projects in a distributed repository system like Git or Mercurial in addition to SVN would greatly help code development.

      Provide more consulting / training / visualization:   11 comments

    Provide more introductory training or links to training media (e.g. video, etc) on high performance computing in general.

    ... I could possibly use some more web-based tutorials on various topics:
    MPI programming, data analysis with NERSC tools, a tutorial on getting Visit (visualization tool) to work on my Linux machine.

    A discussion of the queue structure, suggestions for getting the best turnaround, strategies for optimizing code for NERSC machines would be great.
    Add (current web content is v. nice, I will consume all you can produce) answers to questions like the following:
    what exactly is ftn doing??
    do you know of any 'gossip' regarding which compiler flags are effective on the various NERSC platforms? what are you guys using and why?
    how do I determine the 'correct' node count for a given calculation?
    on which platform should I run code x?
    as computer guys, what advice do you have for practicing (computational) scientists?
    do you feel like the NERSC computers are being used in the manner in which they were intended to be used?

    Improve my ability to learn how to best use NERSC resources: mostly in the form of easy to find and up to date documentation on how to use the systems and tools effectively.

    There seems to be a strong push for concurrency in the jobs on large core machines especially hopper and often time to solution is ignored or not given enough weight when considering whether to install a particular application. Given scarcity of resources, this policy seems to force researchers to resort to less efficient codes for a particular purpose.
    Hence, If NERSC can provide some benchmark calculations and probably rate the softwares in particular category e.g. computational chemistry, solid state chemistry etc, it can be a tremendous help when deciding which software to use on a particular platform. ...

    I've had trouble with large jobs (4000+ cores) on Franklin, where I get various forms of errors due to the values of default MPI parameters (unex_buffer_size and many others). I find that I have to play around with these values to get the jobs to run -- this can be very frustrating given the semi-long queue times. This frustration has caused me to both run jobs at smaller core-counts (for longer time) and to run my largest jobs on the BG/P at ANL instead (although I would prefer to keep all my data at NERSC). While the helpdesk has been helpful in explaining which MPI parameters I should change etc, I still have not found any setting that removes all these issues. Alternatively, if I could learn how to change the code such that these problems don't occur I would do that -- but I don't know how. If the limitations on message/buffer sizes can't be removed, then maybe add a section on the website explaining what types of MPI calls might cause problems?

    Consulting should be easily available, see my comment above. [In old times (i.e. early 1980's), on line consulting help was readily available, and was very useful for the productivity. I hope I could get that kind of help now.]

    ... Faster response times from the help-desk on simple technical questions.

    Beside more time, that only things I can think about is visualization, and we (as a collaboration) have not made real effort to integrate visualization into our program, but this is something that must happen.

    ... I also need a better way to perform remote visualization on Euclid with Visit.

    I am really happy with the level of support and availability of computing resources. There are little things that could be adjusted, mostly related to the character of our project. For example, one of students had a trouble to execute VisIt on Franklin using batch system. I do not think the process is documented and explained. Ultimately, Hank Childs' assistance proved instrumental in resolving the problem. Also, our data sets are quite big and having twice larger scratch space to start with would make production easier. Again, this is quite specific to our application profile. ...

      Things are currently good / not sure:   11 comments

    the machines are really good

    Maintain the same level of overall high quality as so far.

    The level is already so satisfactory.

    i am not sure about this

    ... When I have time to be productive, NERSC systems and consultants are there to help. Unless you can give me a couple more days in the week, I can't think of anything NERSC could do.

    Keep it up.

    Obtaining the complete Hopper will be a great improvement on already fantastic national resource.

    By doing as now keeping abreast of new visualization and analysis technologies and transferring these capabilities to users.

    Can't think of anything.

    at this point, the limit of my productivity is me, not NERSC

    Nothing that I can think of.

      Additional / different data storage services:   10 comments

    ... Increase home and scratch area quota in general. Lot of time is wasted in managing the scratch space and archiving and storing the data.

    Treat batch jobs and visualization queues differently upon quota limit issue. Copying files to HPSS is quick enough, but it's pretty slow to transfer file back from HPSS to Franklin.

    ... Second, the files in the scratch directory should not be deleted without permission from the users. If the files exceed the limit, NERSC should send a list of users who exceed the limit and block submission new jobs until the user's quota backs to the normal. One can send two warnings beforehand. At present, everybody suffers, since just a few users exceed their limits.

    Improve the performance and scaling of file I/O, preferably via HDF5. Make the long term file storage HPSS interface more like a normal Unix file system. Provide larger allocations of scratch and project space.

    ... Third, HPSS should allow users to view the file contents. Add a "less" "more" there. At present, I have to transfer files back to franklin and view to see whether those are the files that I need. ...

    Make hsi/htar software available on a wider variety of Linux distributions.

    If there were a quicker way to access stored data that would be nice.

    More scratch space on franklin ... The scratch space thing was an issue for me recently; it's sort of sporadic getting gauge configuration ensembles copied from NCSA's mass storage system, so it's helpful if I can move over a large number at once and leave them on your storage. I don't use most of the bells and whistles, so I have no comment about them.

    Give more disk space for data analysis.

    Access to data during downtimes is very useful. Also access to error/output files while a job is running is a useful way to help reduce wasted compute hours when a job isn't doing what it should. I believe NERSC is already working on increasing the number of machines with these features though.

      Improve stability/reliability:   10 comments

    Number one frustration:
    I wait in the franklin queue for four to six days. I finally get the cores I need to run (4096-8192), and the computation begins. Sometimes things hang together perfectly. However about one in three or four times, a single node will drop out mid computation and will bring the entire calculation to a screeching halt with everything lost since the last checkpoint file was written. I'm then stuck waiting in the queue for another four to six days only to have the same happen again.
    Reliability is absolutely essential. ...

    Keep the supercomputers up more. Make them more stable. Reduce the variability in the wallclock times for identical jobs.

    less down time.

    ... Make machines like Franklin go down less often.

    ... Minimize downtime.

    Less HPC machine downtime

    ... Finally, back in April/May, I believe, I/O on Carver appeared suffering erratic changes in I/O rates for our code (FLASH + parallel HDF5). Perhaps this issue is resolved now.

    Keep higher uptime. ...

    My own productivity would benefit most from a reduction in machine downtimes and the "word too long" errors that occasionally kill batch jobs on startup. ...

    Many job crashes due to node failure and other unknown reasons have significant impact on my simulations on Franklin.

      Provide more memory:   6 comments

    The computational resources of NERSC are somehow still limited. My case is using VASP which requires quite large RAM, whereas the RAM of Franklin is too small of the individual node. Also, the Cray XT4 processors might be obsolete.

    1)More memory per node
    2)More processors per node

    If in future one can run jobs with more Memory than available at present , researchers in general would benefit tremendously.

    There is a need for machines with large shared memory. Not all tasks perform well on distributed memory, or alternatively are hard to parallelize. As it is now, I don't use Nersc HPC resources due to there being too little memory on the nodes of all machines. Something with 32-64gb of addressable memory per node would fill a very real niche.

    Most of my jobs require large memory per core. The application software is both CPU and memory intensive. The memory is usually the limit of the problem size I can solve. Wish the memory per core could be larger.
    The caver machine is a better machine. But too fewer cores as compared to Franklin. The queue time on Carver was usually very long.

    ... Memory is a constraint.

      Better network performance:   3 comments

    ... and more bandwidth when downloading files to my local machine

    Better bandwidth to small institutions

    Faster connection to external computers/networks

      Other comments

    Allow more than 3 password tries before locking someone out

    NERSC response: NERSC's password policies must adhere to Berkeley Lab's Operational Procedures for Computing and Communications, which specifies that "three failed attempts to provide a legitimate password for an access request will result in an access lockout".

    If scheduled maintenance was at the weekend that would make my work more productive.

    My only complaint is with the allocation process. While I understand that not all allocations can (or should) be fully awarded, it would be helpful to get some explanation as to what the reasoning was behind the amount that was awarded. Even just a couple sentences would be great, this would let the users know what it is we are doing right and what we can improve upon for the next round. We do spend a lot of time on these applications, so feedback is always welcome.

 

If there is anything important to you that is not covered in this survey, please tell us about it.   15 respondents

  Software comments

I encountered some strange behaviour with some installation code. When logined from CentOS distribution Linux it did not work. However, using other linux machine worked just fine. I am not sure what could be the problem, but maybe some information about this issue posted on your webpage could be useful for new NERSC users.

If I haven't made it abundantly clear, start by making mercurial or git the default repository for hosting projects. Subversion is as outdated as AOL

Sometime, softwares keeping updated made some troubles to compile models.

It is important for my legacy code that NERSC continue to provide access to older versions of certain scientific computing libraries, such as PETSc 2.3.x and HDF5 1.6.x. I have been happy with the availability of these libraries on Franklin and Hopper thus far, and hope it will continue.

  Storage comments

I mostly run DFT code VASP on Hopper and Carver. My disk quota is only 40 GB which is definitely not enough for me. The volume of output files from one VASP job is typically around 1.2 GB. It means that I can only store 30 complete jobs on NERSC and I have to delete the rest of output files.
This is the only inconvenience that I felt for NERSC.
It would be perfect if I have a larger disk quota, for example, 400 GB.

... We could use a lot more disk space so that we wouldn't have to use the HPSS as a file server.

It would be great if NERSC could share its expertise with GPFS with other LBNL groups such as SCS, so that LBNL clusters could also use it.

The common home directory is a bad choice, since Carver and Hopper have different architectures from Franklin. My codes and job scripts are built on Franklin. When I transfer them to Carver and Hopper, they are not compatible. And even worse, Carver and Hopper do not have the same capacity as Franklin to run jobs. After I change them on Carver and Hopper, I can not run the codes. I think either one is allowed to run longer jobs on Carver/Hopper, or adding more nodes to it. I understand Carver/Hopper replaces Bassi, but I think Carver/Hopper should do a better job than Bassi.
I have more to add, but I have to check my jobs.

  Job and Queue comments

I'd like to write the log file and out file myself when submit a pbs job instead of using the system distributed XXXXXX.out and XXXXXX.err

improve walltime run limits. It is very inefficient to place 24 hour, or even 48 hour, time limits on quantum chemistry jobs. Simply computing start up information such as wavefunctions or correlation requires significant amount of time. If one wants to do much more than that, they get cut off at the time limit and then have to start over, thus having to re-do the expensive parts of the calculation. Couldn't there be a long queue in place? Even if this were limited to a very small number of processors it would still be useful.
I currently only use NERSC as a back up because of the short walltime limits.

I am disappointed that the number of nodes required for large jobs and the large jobs discount has increased. This change has dramatically increased my waiting times.

  Performance and Reliability comments

We don't use the NERSC systems much because it is too difficult to do simple things such as run our models. The machines are crashing every few days and for one or two days before a crash, things slow down dramatically.
Davinci is sometimes so slow as to be nearly useless.

Execution often stops due to node failure when using more than 1024 nodes for >10 hours. I am not sure what can be done about this, but it would be good to improve the reliability of each nodes.

  Other comments

I would hope that in the future, there would be more timely monitoring and notification of shutdowns affecting HPC assets such as Franklin (which goes down very frequently). I wish the main home page would be modified to have a real time status and MOTD display somewhere prominent.

I have trouble connecting with multiple SSH shells into Franklin.

Thank you all for such a good job.