Skip Nav

Lustre File Systems at ERDC DSRC

1. Introduction

HPC systems at ERDC employ the Lustre file system for their home directories as well as their work directories. Lustre is a high-performance parallel file system. The work directory, in particular, is dedicated to the temporary storage of large data sets produced during execution of user batch jobs. It is this high-performance purpose for which Lustre is especially well suited.

2. Lustre Basics

Lustre is a robust file system which consists of servers and storage. A Metadata Server (MDS) makes metadata (for example, ownership and permissions of a file or directory) available to Lustre clients. Object Storage Servers (OSSs) provide file I/O service for Object Storage Targets (OSTs) which provide the data storage. A Lustre parallel file system achieves its performance by automatically partitioning data into chunks and writing the chunks in round-robin fashion across multiple OSTs. This process, called "striping," plays a vital role in writing or reading very large files because it can significantly improve file I/O speed by eliminating single-disk I/O bottlenecks.

Striping offers two benefits for large files: 1) an increase in bandwidth because multiple processes can simultaneously access the same file, and 2) helping to maintain balance in the usage across the pool of OSTs. However, striping is not without disadvantages: 1) increased overhead due to network operations and server contention, leading to 2) the potential for degrading I/O performance through inappropriate striping settings, and 3) wasting disk space by using stripes that are much larger than the data written to them. Users have the option of configuring the size and number of stripes used for any file, but determining the best settings sometimes requires experimentation.

The term "stripe count" refers to the number of stripes into which a file is divided when written, in other words, the number of OSTs that are used to store the file. Thus, each stripe of the file will reside on a different OST. "Stripe size" refers to the size of the data blocks written to an OST.

Suppose for example, 200 MBytes are to be written to a file that was created with a stripe count of 10 and a stripe size of 1 MByte. When the file is initially created, 10 1-MByte blocks will be simultaneously written to 10 different OSTs. Once those 10 blocks have been filled, Lustre writes another 10 1-MByte blocks to those 10 OSTs. This process continues until the entire file has been written. Upon completion, the file will exist as 20 1-MByte blocks of data on each of 10 separate OSTs.

The following table lists technical specifications for the Lustre file systems on ERDC HPC systems.

Specification of ERDC Lustre Work File Systems
System File System Maximum Capacity Number of OSTs OST Capacity Default Stripe Count Default Stripe Size
Onyx /p/work 12.8 PBytes 78 168 TBytes 1 1 MByte

3. Lustre Stripe Guidance

As mentioned above, one of the primary benefits to striping large files is the increased I/O performance with reading and writing. A secondary benefit is that spreading large files over multiple OSTs helps prevent individual OSTs from filling.

The default stripe counts and stripe sizes have been chosen to balance the needs of I/O performance and efficient use of disk space. On the one hand, the stripe size multiplied by the stripe count is the minimum amount of space that will be allocated for any file. For example, a file of only 10 KBytes of actual data will still be allocated 4 MBytes of space if its stripe count is 4 and the stripe size is 1 MByte. Files smaller than 100 MBytes should be striped with a count of 1 (the default). However, setting the stripe count too high can degrade I/O performance. Therefore, you are urged to carefully match stripe specifications to your data.

Striping must also be compatible with the application's I/O strategy. Increasing the stripe count and/or stripe size should be done as needed for multi-node I/O, and it is strongly advised when creating files that are larger than 100 GBytes. As a rule, when writing a large volume of data, an application should try to use all the OSTs. If writing a single file, set the stripe count to the number of OSTs. When writing more files than there are OSTs, set the stripe count to 1. If the number of files being written is fewer than the number of OSTs, set the stripe count so that all OSTs will be used. The following table offers some striping guidelines.

Recommended File Striping on Onyx
File Size, Per File Stripe Count Stripe Size ‡
<= 10 MBytes 1 1 MByte
10 MBytes to 100 MBytes 1
100 MBytes to 1 GByte 2
1 GByte to 10 GBytes 4
10 GBytes to 100 GBytes 8
100 GBytes to 512 GBytes 16
512 GBytes to 1 TByte 16
1 TByte to 2 TBytes 32
2 TBytes to 4 TBytes 32
4 TBytes to 10 TBytes * 32
>= 10 TBytes * 64
* When writing very large files, note that the tape archive system cannot archive files larger than 7 TBytes.
‡ Onyx storage is configured with a stripe size of 1 MByte. Experimentation may in some cases show improved performance by using stripe sizes of 2 or 4 MBytes.

In addition to striping considerations, for good Lustre performance, small I/O requests or writing many files should be avoided. It is better to gather small requests into a buffer, and write the buffer when it is full. On Cray machines, the iobuf facility is recommended for this kind of I/O aggregation. (For more information, load the iobuf module and see "man iobuf".) Application-level I/O facilities that may offer improved performance include MPI/IO, ADIOS, NetCDF, and HDF5.

4. Lustre Striping Commands

As implied by the above table, the file system stripe count and size have default settings. Stripe parameters can be set for individual files and set or changed for directories. Directories can be given a stripe setting so that all new files created in that directory (and under any sub-directory) share that setting. Utilities and application libraries are provided to control the striping of an individual file at creation time. However, changing the stripe parameters on an existing file has no effect. You must first create an empty file with the desired striping characteristics and then write your data to it. Likewise, changing the stripe parameters on a directory does not change the striping of files already existing in that directory. Only new files created in the modified directory will inherit the changed striping.

4.1. The lfs getstripe Command

The characteristics of a file or a directory can be found by using "lfs getstripe".

$ lfs getstripe MyDir
stripe_count:  1 stripe_size:   1048576 stripe_offset:  -1

The output shows that files created in the directory MyDir will be stored in one stripe of 1048576 bytes (1 MByte) unless explicitly striped otherwise before writing. The stripe_offset of -1 means that each file will have an OST offset determined by Lustre (see "lfs setstripe", next).

4.2. The lfs setstripe Command

To set the striping for a file or directory, use the "lfs setstripe" command:

lfs setstripe --size stripe-size --index OST-start-index --count stripe-count file-or-directory

size - # of bytes written to one OST before cycling to the next
index - starting OST (default of -1 is round robin, highly recommended)
count - # of OSTs (default is 1)

The "lfs setstripe" command has an option for changing the stripe size, but the default stripe size is recommended. Moreover, the "lfs setstripe" command has an option for setting the position of the first stripe among the OSTs, called the offset. Users should not specify an offset. Instead, allow the Lustre file system to choose an offset.

The following creates an empty file named LargeFile with a stripe count of 8.

$ lfs setstripe LargeFile -c 8

Next, set the stripe count to 16 for a new directory named LargeDir. Note that any subdirectories created under LargeDir will inherit its new stripe characteristics.

$ mkdir LargeDir
$ lfs setstripe LargeDir -c 16

4.3. Striping in Practice

The "lfs setstripe" command can be placed in the PBS batch script or executed interactively before job submission. Files copied from one directory to another, such as with cp, cat, scp, or tar, inherit the striping of the new directory. Again, note that changing the striping parameters for a directory does not change the striping characteristics for files already in that directory. Only new files written into that directory will inherit the revised characteristics. Likewise, using the mv command will not change the striping characteristics of a file, but files created during program execution will inherit the characteristics of the directory into which they are written.

Additional information can be found by viewing the lfs man page on any HPC system.