System Status and Outage Notification Policy
System Status Definitions
Down (System Wide Outage, or SWO)
Any situation where one of the following is true:
- Utilization drops below 50% for more than 15 minutes due to a malfunction of hardware, software, or other infrastructure.
- The system is still up according to item 1 above, but users are unable to log in to the system and perform normal functions for more than 60 minutes. Normal functions include compiling codes, submitting batch jobs or querying the batch system, transferring data into and out of the system, and using the primary scratch file system.
- NERSC management determines that the system fails to adequately fulfill its intended function (e.g., experiences a high job failure rate).
Degraded
Any situation where one of the following is true:
- System utilization is less than 90% and greater than 50% due to a malfunction of hardware, software, or other infrastructure.
- Any mounted file system is experiencing degraded performance.
- The Community File System (CFS) is not available.
- Wide area network performance is degraded.
- Partitions are down for more than 15 minutes.
- NERSC management determines that the system functionality significantly impacts user productivity but it is not experiencing a system wide outage.
Up
Any situation not meeting the above criteria for being down or degraded.
Investigating a reported issue
When a system is Up, users may still encounter issues that may or may not turn out to be the result of a malfunction. If user reports suggest a possible malfunction, NERSC engineers will look into the problem to determine whether that is the case. During the investigative period, this additional status may be added as a note along with the status of Up. It applies to any situation where the the system is up and the following are both true:
- NERSC engineers are aware of reports from users that could be the result of a malfunction of hardware, software, or other infrastructure affecting utilization of a system.
- Engineers are actively investigating such an issue but have not yet determined whether it meets the criteria for being down or degraded.
Scheduled Outages
For an outage to be considered a scheduled outage, the user community must be notified of the need for a maintenance event window no less than 24 hours in advance of the outage (emergency fixes). Users will be notified of regularly scheduled maintenance (i.e., scheduled outages that repeat at relatively consistent time intervals) in advance, on a schedule that provides sufficient notification, no less than 72 hours prior to the event and preferably as much as seven calendar days prior. If a regularly scheduled maintenance is not needed, users will be informed of the cancellation of that maintenance event in a timely manner. Any interruption of service that does not meet the minimum notification window is categorized as an unscheduled outage.
Outages that extend past a scheduled maintenance window by 4 hours or less are considered part of the original scheduled maintenance. After 4 hours, the downtime becomes a new event, an unscheduled outage.