NERSC First to Reach Decade-Long Goal of Seamless Shutdown, Restart on Massively Parallel Processing System
October 21, 1997
BERKELEY, CA - The National Energy Research Scientific Computing Center (NERSC) today announced a milestone in high-performance computing: successfully stopping and restarting a number of scientific computing jobs on a CRAY T3E supercomputer without any data processing loss or discontinuity.
Called "checkpointing," the stop/restart procedure achieved twice in one week at NERSC is believed to be the first time such a procedure has been accomplished on a massively parallel processing (MPP) supercomputer. Checkpointing has been a major goal in the MPP community since the first parallel machine was plugged in 10 years ago. C. William McCurdy, head of the Computing Sciences organization at Ernest Orlando Lawrence Berkeley National Laboratory, called NERSC's checkpointing milestone "a remarkable achievement."
Checkpointing maximizes system availability for users and minimizes wasted compute cycles because no recomputation is necessary after restarting. The process brings all of the programs running on the computer to the same stage (or checkpoint) and stops them, records all the information, transfers that information out of the machine, then puts information back in and gets it all running again with no loss of processing time or data. Recovery of the unfinished applications resumes from the interrupt point.
While this feature has been available on Cray's vector systems for more than 12 years, the company made it available on the T3E system in 1997. Checkpointing on MPP systems is significantly more difficult because of the complexity of synchronizing up to 2,000 processors.
Although being able to stop and restart a computer system without data loss is important for any system, the value is much greater as the size of the system increases. For example, without checkpointing, when a single-processor system runs 12 hours of computing work, is interrupted and cannot be restarted, it loses 12 hours of work. On the other hand, when a 2,000-processor system runs 12 hours and cannot be restarted it loses 24,000 hours of computing work.
"As far as I know, no other MPP system vendor is planning to have system-wide checkpoint/restart features without having to reprogram applications," said Bill Kramer, deputy director of NERSC. "Therefore, this is really a momentous step for those of us in the high-performance computing community."
Successful checkpointing will allow the NERSC staff to suspend system work with minimal disruption and downtime for the hundreds of T3E users around the country, making NERSC an even more valuable computational science resource.
"This signifies a major milestone in Cray's and NERSC's commitment to provide robust, reliable MPP computing cycles to DOE's unclassified energy research community," said Michael Declerck of the NERSC Systems Group. Declerck is the computer scientist charged with putting the CRAY T3E-900 through its month-long acceptance tests. The machine was delivered in mid-July. "We think this is the first practical demonstration of checkpointing in a working, MPP production environment. This functionality allows NERSC to minimize disruption to scientific computing and provides the center with the capability to run large and extremely long-running jobs."
In addition to allowing "transparent" maintenance and upgrades, the checkpointing software tool will allow NERSC to efficiently move jobs between processors or make larger pools of processors available for bigger jobs. The center will be able to efficiently manage the transition from running large workloads with many different applications to dedicating the system to one single, complex problem that spans the full 512-processor system.
"MPP started as an experiment in a small niche of the scientific research computing environment," said Steve Reinhardt, CRAY T3E project director at Cray Research Inc. "This achievement with the CRAY T3E illustrates Cray's commitment to production-quality, highly scalable computing. This expands the applicability of MPP to a much wider range of industries and uses." The successful checkpointing was made possible by software developed by Cray Research Inc. in close collaboration with NERSC. The procedure was successfully demonstrated on both of NERSC's CRAY T3E supercomputers, the 512-processor and the 160-processor units. The checkpointing was performed once to allow scheduled maintenance and a second time to test advanced operating system features. The restarted jobs were running on clusters ranging from 16 to 256 processors.
"After we completed the downtimes, all of the user jobs on the machine were successfully restarted and the machines were put back on line," said James Craw, head of the NERSC Systems Group. "It's interesting that we achieved this major milestone and none of our users noticed--which was our objective."
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, NERSC serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.