Response Survey
Many thanks to the 421 users who responded to this year's User Survey. The response rate is comparable to last year's and both are significantly increased from previous years:
- 77.4 percent of users who had used more than 250,000 XT4-based hours when the survey opened responded
- 36.6 percent of users who had used between 10,000 and 250,000 XT4-based hours responded
- The overall response rate for the 3,134 authorized users during the survey period was 13.4%.
- The MPP hours used by the survey respondents represents 70.2 percent of total NERSC MPP usage as of the end of the survey period.
- The PDSF hours used by the PDSF survey respondents represents 36.8 percent of total NERSC PDSF usage as of the end of the survey period.
The respondents represent all six DOE Science Offices and a variety of home institutions: see Respondent Demographics.
The survey responses provide feedback about every aspect of NERSC's operation, help us judge the quality of our services, give DOE information on how well NERSC is doing, and point us to areas we can improve. The survey results are listed below.
You can see the 2008/2009 User Survey text, in which users rated us on a 7-point satisfaction scale. Some areas were also rated on a 3-point importance scale or a 3-point usefulness scale.
|
|
The average satisfaction scores from this year's survey ranged from a high of 6.68 (very satisfied) to a low of 4.71 (somewhat satisfied). Across 94 questions, users chose the Very Satisfied rating 8,060 times, and the Very Dissatisfied rating 90 times. The scores for all questions averaged 6.15, and the average score for overall satisfaction with NERSC was 6.21. See All Satisfaction Ratings.
For questions that spanned previous surveys, the change in rating was tested for significance (using the t test at the 90% confidence level). Significant increases in satisfaction are shown in blue; significant decreases in satisfaction are shown in red.
|
Highlights of the 2009 user survey responses include:
- Areas with Highest User Satisfaction
- Areas with Lowest User Satisfaction
- Largest Increases in Satisfaction
- Largest Decreases in Satisfaction
- Satisfaction Patterns for Different MPP Respondents
- Changes in Satisfaction for Active MPP Respondents
- Changes in Satisfaction for PDSF Respondents
- Survey Results Lead to Changes at NERSC
- Users Provide Overall Comments about NERSC
The complete survey results are listed below and are also available from the left hand navigation column.
Areas with Highest User Satisfaction
Areas with the highest user satisfaction are HPSS reliability and uptime, account and consulting support, grid job monitoring, NERSC Global Filesystem uptime and reliability, and network performance within the NERSC center.
7=Very satisfied, 6=Mostly satisfied, 5=Somewhat satisfied, 4=Neutral, 3=Somewhat dissatisfied, 2=Mostly dissatisfied, 1=Very dissatisfied
Item | Num who rated this item as: | Total Responses | Average Score | Std. Dev. | Change from 2007 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | |||||
HPSS: Reliability (data integrity) | 1 | 3 | 1 | 36 | 116 | 157 | 6.68 | 0.65 | 0.01 | ||
SERVICES: Account support | 1 | 3 | 8 | 87 | 248 | 347 | 6.66 | 0.64 | -0.05 | ||
HPSS: Uptime (Availability) | 2 | 4 | 44 | 107 | 157 | 6.63 | 0.60 | 0.09 | |||
CONSULT: Timely initial response to consulting questions | 2 | 4 | 10 | 89 | 221 | 326 | 6.60 | 0.67 | 0.05 | ||
GRID: Job Monitoring | 2 | 2 | 17 | 41 | 62 | 6.56 | 0.72 | 0.48 | |||
OVERALL: Consulting and Support Services | 1 | 3 | 7 | 11 | 108 | 256 | 386 | 6.56 | 0.76 | -0.07 | |
NGF: Uptime | 4 | 19 | 46 | 69 | 6.55 | 0.78 | -0.12 | ||||
NGF: Reliability | 1 | 2 | 1 | 19 | 46 | 69 | 6.55 | 0.80 | -0.13 | ||
CONSULT: Overall | 1 | 2 | 7 | 13 | 98 | 212 | 333 | 6.53 | 0.77 | -0.04 | |
NETWORK: Network performance within NERSC (e.g. Seaborg to HPSS) | 1 | 1 | 5 | 5 | 56 | 117 | 185 | 6.51 | 0.83 | -0.08 |
Areas with Lowest User Satisfaction
Areas with the lowest user satisfaction are Bassi queue wait times and Franklin uptime. This year only two questions received average scores lower than 5.5, and there were no average scores lower than 4.5. This compares with last year, when 1 average score was lower than 4.5 (Bassi wait time) and 9 were between 4.5 and 5.5.
7=Very satisfied, 6=Mostly satisfied, 5=Somewhat satisfied, 4=Neutral, 3=Somewhat dissatisfied, 2=Mostly dissatisfied, 1=Very dissatisfied
Item | Num who rated this item as: | Total Responses | Average Score | Std. Dev. | Change from 2007 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | |||||
Franklin: Uptime (Availability) | 11 | 15 | 46 | 25 | 71 | 89 | 45 | 302 | 4.91 | 1.62 | -0.13 |
Bassi: Batch wait time | 7 | 9 | 21 | 11 | 27 | 38 | 16 | 129 | 4.71 | 1.72 | 0.25 |
Largest Increases in Satisfaction
The largest increases in satisfaction over last year's survey are for PDSF interactive services, grid job monitoring, Franklin !/O performance, the PDSF and Jacquard batch queue structure, and network connectivity.
7=Very satisfied, 6=Mostly satisfied, 5=Somewhat satisfied, 4=Neutral, 3=Somewhat dissatisfied, 2=Mostly dissatisfied, 1=Very dissatisfied
Item | Num who rated this item as: | Total Responses | Average Score | Std. Dev. | Change from 2007 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | |||||
PDSF: Ability to run interactively | 1 | 1 | 2 | 3 | 23 | 23 | 53 | 6.15 | 1.13 | 0.60 | |
GRID: Job Monitoring | 2 | 2 | 17 | 41 | 62 | 6.56 | 0.72 | 0.48 | |||
Franklin: Disk configuration and I/O performance | 7 | 5 | 13 | 35 | 29 | 112 | 81 | 282 | 5.60 | 1.43 | 0.46 |
PDSF: Batch queue structure | 1 | 1 | 7 | 20 | 23 | 52 | 6.21 | 0.89 | 0.33 | ||
Jacquard: Batch queue structure | 1 | 3 | 11 | 44 | 36 | 95 | 6.17 | 0.83 | 0.25 | ||
OVERALL: Network connectivity | 1 | 1 | 10 | 13 | 30 | 135 | 205 | 395 | 6.28 | 0.99 | 0.15 |
Largest Decreases in Satisfaction
The largest decreases in satisfaction over last year's survey are Franklin batch wait time, computer and network operations 24 by 7 support, and the NERSC web site.
7=Very satisfied, 6=Mostly satisfied, 5=Somewhat satisfied, 4=Neutral, 3=Somewhat dissatisfied, 2=Mostly dissatisfied, 1=Very dissatisfied
Item | Num who rated this item as: | Total Responses | Average Score | Std. Dev. | Change from 2007 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | |||||
Franklin: Batch wait time | 4 | 5 | 20 | 24 | 57 | 119 | 70 | 299 | 5.55 | 1.32 | -0.30 |
SERVICES: Computer and network operations support (24x7) | 3 | 10 | 20 | 30 | 111 | 172 | 346 | 6.17 | 1.09 | -0.17 | |
WEB SERVICES: www.nersc.gov overall | 1 | 3 | 9 | 20 | 167 | 148 | 348 | 6.28 | 0.80 | -0.10 |
Satisfaction Patterns for Different MPP Respondents
The MPP respondents were classified as "large" (if their usage was over 250,000 hours), "medium" (usage between 10,000 and 250,000 hours) and "small". Satisfaction differences between these three groups are shown in the table below. Comparing their scores with the scores of all the 2007/2008 respondents, this year's smaller users were the most satisfied, and the larger users the least satisfied.
Item | Large MPP Users: | Medium MPP Users: | Small MPP Users: | ||||||
---|---|---|---|---|---|---|---|---|---|
Num Resp | Avg Score | Change 2007 | Num Resp | Avg Score | Change 2007 | Num Resp | Avg Score | Change 2007 | |
GRID: Job Monitoring | 13 | 6.54 | -0.04 | 26 | 6.54 | 0.46 | 11 | 6.64 | 0.56 |
SERVICES: Account support | 67 | 6.54 | -0.17 | 130 | 6.63 | -0.07 | 77 | 6.79 | 0.09 |
OVERALL: Security | 72 | 6.12 | -0.23 | 145 | 6.44 | 0.07 | 82 | 6.55 | 0.19 |
WEB SERVICES: NIM web interface | 71 | 6.35 | 0.07 | 135 | 6.44 | 0.16 | 76 | 6.49 | 0.21 |
OVERALL: Network connectivity | 74 | 6.08 | -0.05 | 147 | 6.35 | 0.22 | 84 | 6.40 | 0.28 |
SERVICES: Computer and network operations support (24x7) | 67 | 5.96 | -0.39 | 128 | 6.14 | -0.21 | 68 | 6.37 | 0.02 |
Jacquard: Batch queue structure | 14 | 5.50 | -0.42 | 36 | 6.17 | 0.25 | 31 | 6.39 | 0.47 |
NETWORK: Remote network performance to/from NERSC | 67 | 5.94 | -0.12 | 90 | 6.19 | 0.13 | 51 | 6.37 | 0.32 |
Jacquard: Disk configuration and I/O performance | 13 | 5.31 | -0.67 | 33 | 6.30 | 0.32 | 31 | 5.97 | -0.01 |
HPSS: User interface | 44 | 5.82 | -0.14 | 53 | 6.02 | 0.06 | 29 | 6.38 | 0.42 |
OVERALL: Available Computing Hardware | 73 | 5.62 | -0.51 | 151 | 5.98 | -0.14 | 86 | 6.20 | 0.07 |
OVERALL: Hardware management and configuration | 72 | 5.64 | -0.34 | 142 | 5.75 | -0.23 | 79 | 5.89 | -0.09 |
Franklin: Ability to run interactively | 56 | 5.75 | 0.17 | 108 | 5.67 | 0.09 | 46 | 5.93 | 0.36 |
Bassi: Batch queue structure | 18 | 5.17 | -0.40 | 58 | 5.53 | -0.03 | 33 | 5.94 | 0.37 |
OVERALL: Data analysis and visualization facilities | 42 | 5.40 | -0.08 | 75 | 5.51 | 0.03 | 43 | 6.00 | 0.50 |
Franklin: Disk configuration and I/O performance | 70 | 5.41 | 0.27 | 133 | 5.60 | 0.46 | 56 | 5.71 | 0.57 |
Jacquard: Batch wait time | 15 | 4.60 | -0.87 | 38 | 5.37 | -0.10 | 33 | 5.91 | 0.44 |
Franklin: Batch wait time | 73 | 5.45 | -0.40 | 142 | 5.49 | -0.36 | 61 | 5.70 | -0.14 |
Bassi: Batch wait time | 18 | 3.61 | -0.85 | 64 | 4.48 | 0.03 | 35 | 5.23 | 0.77 |
Changes in Satisfaction for Active MPP Respondents
The table below includes only those users who have run batch jobs on the MPP systems. It does not include interactive-only users or project managers who do not compute. This group of users showed an increase in satisfaction for the NERSC Information Management (NIM) web interface, which did not show up in the pool of all respondents. This group also showed a decrease in satisfaction for available computing hardware and hardware management and for two of the Jacquard questions.
Item | Num who rated this item as: | Total Responses | Average Score | Std. Dev. | Change from 2007 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | |||||
GRID: Job Monitoring | 2 | 2 | 12 | 34 | 62 | 6.56 | 0.76 | 0.48 | |||
Franklin: Disk configuration and I/O performance | 7 | 5 | 13 | 31 | 26 | 105 | 72 | 259 | 5.58 | 1.46 | 0.43 |
OVERALL: Network connectivity | 1 | 6 | 11 | 23 | 105 | 159 | 305 | 6.30 | 0.94 | 0.17 | |
WEB SERVICES: NIM web interface | 3 | 4 | 15 | 106 | 154 | 282 | 6.43 | 0.75 | 0.15 | ||
OVERALL: Available Computing Hardware | 2 | 4 | 7 | 17 | 42 | 129 | 109 | 310 | 5.95 | 1.13 | -0.17 |
SERVICES: Computer and network operations support (24x7) | 3 | 10 | 14 | 21 | 84 | 131 | 263 | 6.15 | 1.14 | -0.20 | |
Jacquard: Uptime (availability) | 1 | 1 | 2 | 4 | 36 | 45 | 89 | 6.28 | 0.86 | -0.21 | |
OVERALL: Hardware management and configuration | 3 | 1 | 11 | 18 | 57 | 129 | 74 | 294 | 5.76 | 1.13 | -0.22 |
Jacquard: Overall | 1 | 1 | 2 | 6 | 5 | 46 | 29 | 90 | 5.97 | 1.15 | -0.29 |
Franklin: Batch wait time | 4 | 5 | 18 | 21 | 55 | 112 | 61 | 276 | 5.53 | 1.32 | -0.32 |
Changes in Satisfaction for PDSF Respondents
The PDSF users are clearly less satisfied with web services at NERSC compared with the MPP users.
Item | Num who rated this item as: | Total Responses | Average Score | Std. Dev. | Change from 2007 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | |||||
PDSF: Ability to run interactively | 1 | 1 | 1 | 3 | 19 | 15 | 39 | 6.08 | 1.20 | 0.53 | |
PDSF: Batch queue structure | 1 | 5 | 14 | 17 | 37 | 6.24 | 0.89 | 0.36 | |||
WEB SERVICES: NIM web interface | 2 | 4 | 2 | 17 | 14 | 39 | 5.95 | 1.15 | -0.33 | ||
WEB SERVICES: www.nersc.gov overall | 1 | 1 | 7 | 15 | 10 | 34 | 5.94 | 0.95 | -0.44 | ||
WEB SERVICES: Ease of finding information | 3 | 2 | 7 | 14 | 6 | 32 | 5.56 | 1.16 | -0.49 | ||
SERVICES: Allocations process | 4 | 1 | 4 | 11 | 8 | 28 | 5.64 | 1.34 | -0.53 | ||
TRAINING: Web tutorials | 1 | 4 | 3 | 7 | 4 | 19 | 5.47 | 1.22 | -0.67 |
Survey Results Lead to Changes at NERSC
Every year we institute changes based on the previous year survey. In 2008 and early 2009 NERSC took a number of actions in response to suggestions from the 2007/2008 user survey.
- 2007/2008 user survey: On the 2007/2008 survey Franklin's Disk configuration and I/O performance received the third lowest average score (5.15).
NERSC response: In the past year NERSC and Cray staff worked extensively on benchmarking and profiling collective I/O performance on Franklin, conducting a detailed exploration into the source of the low performance (less than 1 GB/s write bandwidth) reported by several individual researchers.
A number of issues were explored at various levels of the system/software stack, from the high-level NetCDF calls to MPI-IO optimizations and hints, block and buffer size allocations on individual nodes, Lustre striping parameters, and the underlying I/O hardware.
These metrics were instrumental in making the case for increased I/O hardware and for making software and configuration changes. Once implemented, the cumulative effect of the hardware, software and middleware improvements is that a class of applications is now able to achieve I/O bandwidths in the 6 GB/s range.
On the 2009 survey Franklin's Disk configuration and I/O performance received an average score of 5.60, a statistically significant increase over the previous year by 0.46 points.
- 2007/2008 user survey: On the 2007/2008 survey Franklin uptime received the second lowest average score (5.04).
NERSC response: In the past year NERSC and Cray assembled a team of about 20 people to thoroughly analyze system component layouts, cross interactions and settings; to review and analyze past causes of failures; and to propose and test software and hardware changes. Intense stabilization efforts took place between March and May, with improvements implemented throughout April and May.
As a result of these efforts, Franklin's overall availability went from an average of 87.6 percent in the six months prior to April to an average of 94.97 percent in the April through July 2009 period. In the same period, Mean Time Between Interrupts improved from an average of 1 day 22 hours h39 minutes to 3 days 20 hours 36 minutes.
The Franklin uptime score in the 2009 survey (which opened in May) did not reflect these improvements. NERSC anticipates an improved score on next year's survey.
- 2007/2008 user survey: On the 2007/2008survey the two lowest PDSF scores were "Ability to run interactively" and "Disk configuration and I/O performance".
NERSC response: In 2008 NERSC improved the interactive PDSF nodes to more powerful, larger memory nodes. In early 2009, we re-organized the user file systems on PDSF to allow for failover, reducing the impact of hardware failures on the system. We also upgraded the network connectivity to the files ystem server nodes to allow for greater bandwidth. In addition, NERSC added a queue to allow for short debug jobs.
On the 2009 survey the PDSF "Ability to run interactively" score increased significantly by 0.60 points and moved into the "mostly satisfied - high" range. The PDSF "Disk configuration and I/O performance" score increased by 0.41 points, but this increase was not statistically significant (at the 90 percent confidence level).
Users Provide Overall Comments about NERSC
130 users answered the question What does NERSC do best? How does NERSC distinguish itself from other computing centers you have used?
- 65 respondents mentioned good consulting, staff support and communications;
- 50 users mentioned computational resources or HPC resources for science;
- 20 highlighted good software support;
- 15 pointed to good queue management or job turnaround;
- 15 were generally happy;
- 14 mentioned good documentation and web services;
- 9 were pleased with data services (HPSS, large disk space, data management);
- 8 complimented good networking, access and security.
Some representative comments are:
Organization is top notch. Queuing is excellent.
Nersc is good at communicating with its users, provides large amounts of resources, and is generally one of the most professional centers I've used.
EVERYTHING !!! From the computing centers that I have used NERSC is clearly a leader.
Very easy to use. The excellent website is very helpful as a new user. Ability to run different jobsizes, not only 2048*x as on the BG/P. In an ideal world I'd only run at NERSC!
NERSC tends to be more attuned to the scientific community than other computer centers. Although it has taken years of complaining to achieve, NERSC is better at providing 'permanent' disk storage on its systems than other places.
NERSC's documentation is very good and the consultants are very helpful. A nice thing about NERSC is that they provide a number of machines of different scale with a relatively uniform environment which can be accessed from a global allocation. This gives NERSC a large degree of flexibility compared to other computational facilities.
As a user of PDSF, I have at NERSC all the resources to analyze the STAR data in a speedy and reliable way, knowing that NERSC keep the latest version of data analysis software like ROOT. Thank you for the support.
NERSC has very reliable hardware, excellent administration, and a high throughput. Consultants there have helped me very much with projects and problems and responded with thoughtful messages for me and my problem, as opposed to terse or cryptic pointers to information elsewhere. The HPSS staff helped me set up one of the earliest data sharing archives in 1998, now part of a larger national effort toward Science Gateways. (see: http://www.lbl.gov/cs/Archive/news052609b.html) This archive has a venerable place in the lattice community and is known throughout the community as "The NERSC Archive". In fact until recently, the lingua franca for exchanging lattice QCD data was "NERSC format", a protocol developed for the archive at NERSC.
The quality of the technical staff is outstanding. They are competent, professional, and they can answer questions ranging from the trivial to the complex.
Getting users started! it can take months on other systems.
Very good documentation of systems and available software. Important information readily available on single web page that also contains links to the original documentation.
113 users responded to What can NERSC do to make you more productive? .
The top two areas of concern were Franklin stability and performance, and the need for more computing resources. Users made suggestions in the areas of data storage, job scheduling, software and allocations support, services, PDSF support and networking.
Some of the comments from this section are:
A few months ago I would have said "Fix Franklin please!!" but this has been done since then and Franklin is a LOT more stable. Thanks...
For any users needing multiple processors, Franklin is the only system. The instability, both planned and unplanned downtimes, of Franklin is *incredibly* frustrating. Add in the 24 hour run time limit, it is amazing that anyone can get any work done.
have more machines of different specialties to reduce the queue (waiting) time
Highly reliable, very stable, high performance architectures like Bassi and Jacquard.
When purchasing new systems, there are obviously many factors to consider. I believe that more weight should be given to continuity of architecture and OS. For example, the transition from Seaborg to Bassi was almost seemless for me, whereas the transition from Bassi to Franklin is causing a large drop in productivity, ie porting codes and learning how to work with the new system. I estimate my productivity has dropped by 50% for 6 months. To be clear, this is NOT a problem with Franklin, but rather the cost of porting, and learning how to work on a different architecture.
Put more memory per core on large-scale machines (>8 GB/core). Increase allowed wall clock times to 48 or 96 hours.
Enhance the computing power to meet the constrained the needs of high performance computation.
Save scratch files still longer
Get the compute nodes on Franklin to see NGF or get a new box.
Make more permanent disk space available on Franklin. It needs something line the project disk space to be visible to the compute nodes. The policies need to be changed to be more friendly to the user whose jobs use 10's or 100's pf processors, and stop making those of us who can't allocate 1000's of processors to a single job feel like second-class users. It should be at least as easy to run 100 50 CPU jobs as one 5000 CPU job. The current queue structure makes it difficult if not impossible for some of us to use our allocations.
Enable longer interactive jobs on Franklin login nodes. Some compile jobs require more than 60 minutes, making building a large code base -- or diagnosing problems with the build process -- difficult. Also, it would be useful to be able to run longer visualization jobs without copying large data sets from one systems /scratch to another. This would be for running visualization code that can't be run on compute nodes; for instance, some python packages require shared libraries.
it would be useful if it was easier to see why a job crashed. I find the output tends to be a little terse.
NERSC does an excellent job in adding new software as it becomes available. It is important to continue doing so.
Allocate more time!
We can always use man power to improve the performance and scaling of our codes.
Keep doing what you are doing. I'm particularly interested in the development of the Science Gateways.
23 users responded to If there is anything important to you that is not covered in this survey, please tell us about it. .