Skip Nav

Introduction and Policy Guide

Table of Contents

1. Introduction

1.1. Document Scope and Assumptions

This document provides an introduction, listing of policies, and guidance on the use of High Performance Computers and Servers at the Department of Defense Supercomputing Resource Center (DSRC) at the U.S. Army Corps of Engineers Research and Development Center (ERDC) in Vicksburg, Mississippi.

1.2. ERDC DSRC - Who We Are

The ERDC DSRC is one of five Department of Defense (DoD) High Performance Computing (HPC) Modernization Program sites providing HPC access to DoD users and contractors. Access to all ERDC systems and servers is subject to DoD regulations and access controls. Users are required to sign a user agreement acknowledging applicable policies and detailing acceptable use of the systems.

1.3. What We Offer

Currently, we have two High Performance Computers (HPCs) and an archive server.

Current Systems
SystemDescription
Onyx Cray XC40/50 with 126,632 compute cores on 2,894 compute nodes and 64 Xeon Phi (KNL) nodes
Archive Server Mass Storage System with a petascale tape file system

We are also home to the Data Analysis and Assessment Center (DAAC) for the HPCMP. DAAC supports EnSight, ParaView, and Visit. Other analysis tools, such as Tecplot and Matlab are indirectly supported through the Secure Remote Desktop (SRD) on HPC systems which have graphics cards.

Users may also request custom made scientific visualization images and animations created by DAAC personnel from user data. For more information about visualization tools and the Secure Remote Desktop, see the DAAC website.

1.4. HPC System Configurations

1.4.1. Onyx

Onyx is a Cray XC40/50. It has 12 login nodes, each with two 2.8-GHz Intel Xeon Broadwell 22-core processors and 256 GBytes of memory. It has 2,858 standard compute nodes, 4 large-memory compute nodes, 32 GPU compute nodes, and 544 Knights Landing (Phi) compute nodes (a total of 3,438 compute nodes or 161,448 compute cores). Each standard compute node has two 2.8-GHz Intel Xeon Broadwell 22-core processors (44 cores) and 128 GBytes of DDR4 memory. Each large-memory compute node has two 2.8-GHz Intel Xeon Broadwell 22-core processors (44 cores) and one TByte of DDR4 memory. Each GPU compute node has one 2.8-GHz Intel Xeon Broadwell 22-core processor, an NVIDIA Pascal GPU (Tesla P100), and 256 GBytes of DDR4 memory. Each Phi compute node has one 1.3-GHz Intel Knights Landing 64-core processor and 96 GBytes of DDR4 memory.

Compute nodes are interconnected by the Cray Aries high-speed network, and have Intel's Turbo Boost and Hyper-Threading Technology enabled. Memory is shared by cores on each node, but not between nodes.

Onyx uses Lustre to manage its high-speed, parallel, Sonexion file system, which has 15 PBytes (formatted) of disk storage. For additional information, see the System Configuration section of the Onyx User Guide.

1.4.2. Archive Server

All of our HPC systems have access to an online archival mass storage system, Gold, that provides long term storage for users' files on a petascale tape file system that resides on a robotic tape library system. A 1.2-PByte disk cache frontends the tape file system and temporarily holds files while they are being transferred to or from tape.

For additional information on using the archive server, see the Archive Guide.

1.5. Requesting Assistance

The HPC Help Desk is available to help users with unclassified problems, issues, or questions. Analysts are on duty 7:00 a.m. - 7:00 p.m. Central, Monday - Friday (excluding Federal holidays).

You can contact the ERDC DSRC directly in any of the following ways for after-hours support:

  • E-mail: dsrchelp@erdc.hpc.mil
  • Phone: 1-800-500-4722 or (601) 634-4400
  • Fax: (601) 634-5126
  • U.S. Mail:
    U.S. Army Engineer Research and Development Center
    ATTN: CEERD-IH-D HPC Service Center
    3909 Halls Ferry Road
    Vicksburg, MS 39180-6199

For more detailed contact information, see our Contact Page.

2. Policies

2.1. Interactive Use

On all ERDC HPC systems, interactive executions on login nodes are restricted to 15 minutes of CPU time per process. Login nodes are shared by all users; if you need to run a computationally intense application or executable for more than 15 minutes, you should use PBS to schedule access to a compute node.

Onyx allows you to submit a PBS interactive job to get an interactive login session on a compute node. To do this, submit your batch script with the "-I" flag and then use the ccmlogin command to login. (See the Interactive Batch section in the Onyx User Guide.)

The HIE queue is available for interactive work, including debugging, pre/post processing, and remote visualization. To make use of the HIE, simply use the queue name HIE when you submit your job script. Read more about HIE in the HIE User Guide.

2.2. Session Lifetime

To provide users with a more secure high performance computing environment, we have implemented a lifetime of all terminal/window sessions. Regardless of activity, any terminal or window session connections to the ERDC DSRC will automatically terminate after 24 hours.

2.3. Purge Policy

All files in $WORKDIR are subject to purges if they have not been accessed in more than 30 days, or if space in $WORKDIR becomes critically low. Using the touch command (or similar) to prevent files from being purged is prohibited.

Note! If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, you WILL NOT be notified prior to deletion. You are responsible to monitor your workspace to prevent data loss.

2.4. Special Requests

All special requests for allocated HPC resources, including increased priority within queues, increased queue parameters for a maximum number of CPUs and wall time, and dedicated use should be directed to the HPC Help Desk.

3. Using HPC Resources

3.1. File Systems

3.1.1. $WORKDIR

$WORKDIR is the local temporary file system (i.e., local high-speed disk) available on all ERDC DSRC HPC systems and is available to all users. All files in $WORKDIR that have not been accessed in more than 30 days are subject to purges.

$WORKDIR is not intended to be used as a permanent file storage area by users. This file system is periodically purged of old files based on file access time. Users are responsible for saving their files to archive or transferring them to their own systems in a timely manner. Disk space limits, or quotas, may be imposed on users who continually store files on $WORKDIR for more than 30 days.

The $WORKDIR file system is NOT backed up or exported to any other system. In the event of file or directory structure deletion or a catastrophic disk failure, such files and directory structures are lost.

REMEMBER: It is your responsibility to transfer files that need to be saved to a location that allows for long-term storage, such as your archival ($ARCHIVE_HOME) or, for smaller files, home ($HOME) directory locations. Please note that your archival storage area has no quota, but your home directory does!

3.1.2. $HOME

When you log on, you will be placed in your home directory. The environment variable $HOME is automatically set for you and refers to this directory. $HOME is visible to both the login and compute nodes and may be used to store small user files, but it has limited capacity and is not backed up; therefore, it should not be used for long-term storage.

IMPORTANT: a hard limit of 30 GBytes is imposed on content in $HOME.

3.1.3. $CENTER

The Center-Wide File System (CWFS) provides file storage that is accessible from the Onyx login nodes and from the HPC Portal. The CWFS allows file transfers and other file and directory operations from the HPC systems using simple Linux commands. Each user has their own directory in the CWFS. The name of your CWFS directory may vary between machines and between centers, but the environment variable $CENTER will always refer to this directory.

The example below shows how to copy a file from your work directory on an HPC system to the CWFS ($CENTER).

While logged into Onyx, copy your file from your HPC work directory to the CWFS.

> cp $WORKDIR/filename $CENTER

3.1.4. Archival File System

The archival file system consists of a tape library and a 1.2-PByte disk cache. Files transferred to the archive server must be temporarily stored on the disk cache before they can be written to tape. Similarly, files being retrieved from tape are temporarily stored on the disk cache before being transferred to the destination system. The system has only a few tape drives, and these are shared by all users and by ongoing tape file system maintenance. As with all tape file systems, writing one large tarball containing hundreds of files is quicker and easier on the tape drives than writing hundreds of smaller files. This is because writing only one file requires only one tape mount, one seek, and one write, but writing hundreds of small files may require many tape mounts, many seeks and many writes. For the same reason, retrieving one tarball from tape is much quicker and easier than retrieving hundreds of individual files. The recommended maximum size of a single tarball is 200 to 300 GBytes. Tarballs larger than 300 GBytes will take an inordinate amount of time to transfer to or from archive, and larger files can prevent efficient packing of files on tapes.

For additional information on using the archive server, see the Archive Guide.

3.2. Network File Transfers

The preferred method for transferring files over the network is to use the encrypted (Kerberos) file transfer programs scp, or sftp. In cases where large numbers of files (> 1000) and large amounts of data (> 100 GBytes) must be transferred, contact the HPC Help Desk for assistance in the process. Depending on the nature of the transfer, transfer time may be improved by reordering the data retrieval from tapes, taking advantage of available bandwidth to/from the Center, or dividing the transfer into smaller parts; the ERDC DSRC staff will assist you to the extent that they are able. Limitations such as available resources and network problems outside the Center can be expected, and you should allow sufficient time for transfers.

For additional information on file transfers, see the "File Transfers" sections of the Onyx User Guide.

3.3. Cray XC40/50 (Onyx)

Onyx has a full Linux OS on its login nodes, but the compute nodes use the Cray Linux Environment (CLE), a light kernel designed for computational performance. The nodes interconnect via an Ares high-speed network.

PBS batch scripts on Onyx run on service nodes with a full Linux OS. These service nodes are shared by all running batch jobs. You must use the aprun command to send your parallel executable to their assigned compute nodes.

By default on Onyx, codes are compiled statically using the compile commands: ftn, cc, or CC. These commands evoke the corresponding compilers from the currently loaded environment module. On Onyx, the Cray (CCE) compiler is loaded by default. You can use the "module swap" command to switch to another version of CCE or to an entirely different compiler suite.

Dynamically linked executables can be compiled and run on compute nodes on Onyx. By compiling with dynamically linked executables, you can define static arrays that are larger than 2 GBytes.

On Onyx, using its Cluster Compatibility Mode (CCM), you can log directly into a compute node from an interactive PBS batch job. To do this, use the ccmlogin command once the job starts on the batch service node. Also under CCM you can run UNIX commands, scripts or serial executables within batch jobs using the ccmrun command.

IMPORTANT! On Onyx, PBS batch scripts run on a batch service node shared by other users' PBS batch scripts. Commands like aprun, ccmlogin, and ccmrun are the only way to send work to your assigned compute nodes. Do not perform intensive work in your batch script without sending those computationally intensive tasks to the compute nodes.

For additional information on using Onyx, see the Onyx User Guide.