Introduction to Scientific I/O
I/O is commonly used by scientific applications to achieve goals like:
- storing numerical output from simulations for later analysis;
- implementing 'out-of-core' techniques for algorithms that process more data than can fit in system memory and must page data in from disk;
- and checkpointing to files that save the state of an application in case of system failure.
In most cases, scientific applications write large amounts of data in a structured or sequential 'append-only' way that does not overwrite previously written data or require random seeks throughout the file.
Most HPC systems are equipped with a parallel file system such as Lustre or GPFS that abstracts away spinning disks, RAID arrays, and I/O subservers to present the user with a simplified view of a single address space for reading and writing to files. Three common methods for an application to interact with the parallel file system are shown in Figure 1.1.
File-per-processor
File-per-processor access is perhaps the simplest to implement since each processor maintains its own filehandle to a unique file and can write independently without any coordination with other processors. Parallel file systems often perform well with this type of access up to several thousand files, but synchronizing metadata for a large collection of files introduces a bottleneck. One way to mitigate this is to use a 'square-root' file layout, for example by dividing 10,000 files into 100 subdirectories of 100 files.
Even so, large collections of files can be unwieldy to manage. Even simple utilities like ls can break with thousands of files. This may not be a concern in the case of checkpointing, where the files are typically thrown away except in the infrequent case of a system failure. Still, another disadvantage of file-per-processor access is that later restarts will require the same number and layout of processors, because these are implicit in the structure of the file hierarchy.
Shared-file (independent)
Shared file access allows many processors to share a common filehandle but write independently to exclusive regions of the file. This coordination can take place at the parallel file system level. However, there can be significant overhead for write patterns where the regions in the file may be contested by two or more processors. In these cases, the file system uses a 'lock manager' to serialize access to the contested region and guarantee file consistency. Especially at high concurrency, such lock mechanisms can degrade performance by orders of magnitude. Even in ideal cases where the file system is guaranteed that processors are writing to exclusive regions, shared file performance can be lower compared to file-per-processor.
The advantage of shared file access lies in data management and portability, especially when a higher-level I/O format such as HDF5 or netCDF is used to encapsulate the data in the file. These libraries provide database-like functionality, cross-platform compatibility, and other features that are helpful or even essential when storing or archiving simulation output.
Shared-file with collective buffering
Collective buffering is a technique used to improve the performance of shared-file access by offloading some of the coordination work from the file system to the application. A subset of the processors is chosen to be the 'aggregators' who collect the data from all other processors and pack it into contiguous buffers in memory that are then written to the file system. Reducing the number of processors that interact with the I/O subservers is often beneficial, because it reduces contention.
Originally, collective buffering was developed to reduce the number of small, noncontiguous writes, but another benefit that is important for file systems such as Lustre is that the buffer size can be set to a multiple of the ideal transfer size preferred by the file system. This is explained in the next section on Lustre striping.