Resiliency Planning
This page provides information and resources to help NERSC users plan for and work around downtime. NERSC consistently out-performs uptime targets, however occasional outages are necessary and unavoidable. These might be planned (e.g. regular required system maintenance or power work) or unplanned/emergency (e.g. in the event of a power outage or a critical security vulnerability). We recommend that users create a resiliency plan to make it easy to run their workload elsewhere during these outages, if required.
Points to consider when writing a resiliency plan:
- Outages can impact any one of NERSC's systems, sometimes individually and sometimes all together (for example in the case of a power outage). Consider which part of NERSC's resources your workflow relies upon - for example, you may still be able to archive data via Globus on the DTN nodes even if Cori compute nodes are down.
- Upcoming scheduled maintenances are listed on the outage calendar. This can be used this to plan your time-sensitive work.
- Our policy is to give 7 days notice for planned system outages, and 1 month notice for planned center-wide outages.
- Live system status (including scheduled and unscheduled outages, and issues that are resulting in a degraded performance of the systems) can be found on the live status page.
- NERSC has different levels of support for the services we offer. For example, issues with the services at the 8x5 support level will not be fixed over the weekend. These are listed on the Service Levels page.
- Data in NERSC home directories and project directories are backed up daily, and those backups are stored for 7 days; data on the Burst Buffer and Scratch filesystems are not. Please see this page for the full NERSC data policy, including access, backup and retention policies.
- We recommend using Globus to transfer data in and out of NERSC. This page describes the NERSC endpoints and gives advice on troubleshooting your Globus connection.
NERSC is working on a number of downtime mitigation efforts, which are designed to reduce the amount of disruption of our services to users due to maintenances or outages.
- NERSC's configuration management process, SMWflow, was put into practice in 2018 which has reduced the duration of maintenances by 25%.
- The NERSC Global Filesystem (NGF, which includes user home directories and the project filesystems) supports rolling upgrades so that the filesystem can stay up while performing software upgrades. This helped NGF achieve 100% scheduled availability in 2018. NGF is architected to be available when the compute systems are down, which can help with data movement in and out of NERSC during outages.
- In launching the Spin service, we are able to provide a more reliable and dynamic platform for science gateways and workflow services.
- Docker images can be used to create portable applications that can run across multiple compute platforms. At NERSC, we have developed Shifter to deploy Docker containers on HPC.
- c2d is a NERSC utility that allows you to create a Docker container from your Conda environment and Jupyter notebook. You can use this tool to transfer your Jupyter-based workflow to another computing resource.
- Checkpointing code (i.e. the act of saving the state of a running process to a checkpoint image file) with the ability to restart it later is critical to fault-tolerant computing. NERSC recommends users incorporate checkpointing into long-running jobs for resilience against node failures or unexpected outages, for example using DMTCP. See these pages for more information.