1. Introduction

1.1. Document Scope and Assumptions

This document provides an overview and introduction to the use of the SuperMicro SuperServer MLA V100 (Vulcanite) located at the ERDC DSRC, along with a description of the specific computing environment on Vulcanite. The intent of this guide is to provide information that will enable the average user to perform computational tasks on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:

  • Use of the UNIX operating system
  • Use of an editor (e.g., vi or emacs)
  • Remote usage of computer systems via network or modem access
  • A selected programming language and its related tools and libraries

1.2. Obtaining an Account

To get an account on Vulcanite, you must first submit a Vulcanite Project Proposal. You will also require an account on the HPCMP Portal to the Information Environment, commonly called a "pIE User Account." Once you have submitted your proposal, if you do not yet have a pIE User Account, please visit HPC Centers: Obtaining An Account and follow the instructions there. If you need assistance with any part of this process, please contact the HPC Help Desk at accounts@helpdesk.hpc.mil.

1.3. Requesting Assistance

The ERDC DSRC HPC Service Center is available to help users with problems, issues, or questions. Analysts are on duty 8:00 a.m. - 5:00 p.m. Central, Monday - Friday (excluding Federal holidays).

To request assistance, contact the ERDC DSRC directly in any of the following ways:

For more detailed contact information, please see our Contact Page.

2. System Configuration

2.1. System Summary

Vulcanite is an exploratory system meant to provide users access to a variety of high density GPU node configurations. Each node type has a different number of processors, amount of memory, number of GPUs, amount of SSD storage, and number of network interfaces. Because of this users should take care when migrating between node types.

Node Configuration
Login 2-GPU 4-GPU 8-GPU
Total Nodes 2 26 8 5
Processor Intel 6126T Skylake Intel 6126T Skylake Intel 6136 Skylake Intel 8160 Skylake
Processor Speed 2.6 GHz 2.6 GHz 3.0 GHz 2.1 GHz
Sockets / Node 1 1 2 2
Cores / Node 12 12 24 48
Total CPU Cores 24 312 192 240
Useable Memory / Node 206 GB 206 GB 384 GB 764 GB
Accelerators / Node None 2 4 8
Accelerator n/a NVIDIA V100 PCIe NVIDIA V100 SXM2 NVIDIA V100 SXM2
Memory / Accelerator n/a 32 GB 32 GB 32 GB
Storage on Node None 2 TB NVMe 4 TB NVMe 8 TB NVMe
Interconnect EDF InfiniBand 1x EDR InfiniBand 1x EDR InfiniBand 2x EDR InfiniBand 4x

File Systems on Vulcanite
Path Formatted Capacity File System Type Storage Type User Quota Minimum File Retention
/home ($HOME) 4 TB XFS SATA SSD 30 GB None
/gpfs/cwfs ($CENTER) 3 PB GPFS HDD 100 TB 120 Days

2.2. Operating System

Vulcanite's operating system is RedHat Enterprise Linux 7.

2.3. File Systems

Vulcanite has the following file systems available for user storage:

2.3.1. /home

/home is a locally mounted SSD with an unformatted capacity of 4 TB. All users have a home directory located on this file system, which can be referenced by the environment variable $HOME. /home has a 30 GB quota.

2.3.2. /gpfs/cwfs

The Center-Wide File System (CWFS) provides file storage that is accessible from all Vulcanite's nodes. The environment variable $CENTER refers to this directory.

2.3.3. /scratch

/scratch is a file system for temporary, compute node SSD storage. The environment variable $TMPDIR will point to a sub-directory created under /scratch for user access for the duration of a job. This sub-directory will exist on all of the job's compute nodes.

The size of on-node storage in /scratch is approximately:

2-GPU nodes:1.3 TB
4-GPU nodes:3.2 TB
8-GPU nodes:6.8 TB

3. Accessing the System

3.1. Kerberos

A Kerberos client kit must be installed on your desktop to enable you to get a Kerberos ticket. Kerberos is a network authentication tool that provides secure communication by using secret cryptographic keys. Only users with a valid HPCMP Kerberos authentication can gain access to Vulcanite. More information about installing Kerberos clients on your desktop can be found at HPC Centers: Kerberos & Authentication.

3.2. Logging In

The login nodes for the Vulcanite cluster are vulcanite01 and vulcanite02.

The preferred way to login to Vulcanite is via ssh, as follows:

% ssh vulcanite.erdc.hpc.mil

4. User Environment

4.1. Modules

A number of modules are loaded automatically as soon as you log in. To see the modules that are currently loaded, use the "module list" command. To see the entire list of available modules, use the "module avail" command. You can modify the configuration of your environment by loading and unloading modules. For complete information on how to do this, see the Modules User Guide.

4.2. Archive Usage

All of our HPC systems have access to an online archival mass storage system that provides long-term storage for users' files on a petascale tape file system that resides on a robotic tape library system. A 672-TB disk cache frontends the tape file system and temporarily holds files while they are being transferred to or from tape.

Tape file systems have very slow access times. The tapes must be robotically pulled from the tape library, mounted in one of the limited number of tape drives, and wound into position for file archival or retrieval. For this reason, users should always tar up their small files in a large tarball when archiving a significant number of files. A good maximum target size for tarballs is about 200 GB or less. At that size, the time required for file transfer and tape I/O is reasonable. Files larger than 1 TB will span more than one tape, which will greatly increase the time required for both archival and retrieval.

The environment variable $ARCHIVE_HOME is automatically set for you and can be used to reference your mass storage archive directory when using archive commands. The command getarchome can be used to display the value of $ARCHIVE_HOME.

4.2.1. Archive Command Synopsis

A synopsis of the archive utility is listed below. For information on additional capabilities, see the Archive User Guide or read the online man page that is available on each system. This command is non-Kerberized and can be used in batch submission scripts if desired.

Copy one or more files from the archive system:

archive get [-C path] [-s] file1 [file2...]

List files and directory contents on the archive system:

archive ls [lsopts] [file/dir ...]

Create directories on the archive system:

archive mkdir [-C path] [-m mode] [-p] [-s] dir1 [dir2 ...]

Copy one or more files to the archive system:

archive put [-C path] [-D] [-s] file1 [file2 ...]

Move or rename files and directories on the archive server:

archive mv [-C path] [-s] file1 [file2 ...] target

Remove files and directories from the archive server:

archive rm [-C path] [-r] [-s] file1 [file2 ...]

Check and report the status of the archive server:

archive stat [-s]

Remove empty directories from the archive server:

archive rmdir [-C path] [-p] [-s] dir1 [dir2 ...]

Change permissions of files and directories on the archive server:

archive chmod [-C path] [-R] [-s] mode file1 [file2 ...]

Change the group of files and directories on the archive server:

archive chgrp [-C path] [-R] [-h] [-s] group file1 [file2 ...]

4.3. Available Compilers

Vulcanite has the GNU and Intel compilers.

Vulcanite has four MPI suites:

  • OpenMPI (GCC)

4.4. Programming Models

Vulcanite supports two base programming models: Message Passing Interface (MPI) and Open Multi-Processing (OpenMP). A Hybrid MPI/OpenMP programming model is also supported.

5. Batch Scheduling

5.1. Scheduler

The Portable Batch System (PBS) is currently running on Vulcanite.

5.2. Queue Information

Vulcanite only has the Standard queue. The maximum wall clock time is 168 hours.

5.3. Interactive Batch Sessions

To get an interactive batch session, you must first submit an interactive batch job through PBS. This is done by executing a qsub command with the "-I" option from within the interactive login environment. For example:

qsub -l select=N1:ncpus=12:mpiprocs=N2:ngpus=2 -A Project_ID -q standard -l walltime=HHH:MM:SS -I

You must specify the number of nodes requested (N1), the number of processes per node (N2), the desired maximum walltime, your project ID, and a job queue.

Your interactive batch session will be scheduled just as normal batch jobs are scheduled depending on the other queued batch jobs, so it may take quite a while. Once your interactive batch shell starts, you can run or debug interactive applications, post-process data, etc.

At this point, you can run parallel applications on your assigned set of compute nodes. You can also run interactive commands or scripts on this node.

5.4. Batch Resource Directives

Batch resource directives allow you to specify to PBS how your batch jobs should be run and what resources your job requires. Although PBS has many directives, you only need to know a few to run most jobs.

The basic syntax of PBS directives is as follows:

#PBS option[[=]value]

where some options may require values to be included. For example, to start an 8-process job, you would request one node of 12 cores and specify that you will be running 8 processes per node:

#PBS -l select=1:ncpus=12:mpiprocs=8:ngpus=2

The following directives are required for all jobs:

Required PBS Directives
Directive Value Description
-A Project_ID Name of the project
-q queue_name Name of the queue
-l select=N1:ncpus=12:mpiprocs=N2:ngpus=2 For 2-GPU nodes:
N1 = Number of nodes
N2 = MPI processes per node
-l select=N1:ncpus=24:mpiprocs=N2:ngpus=4 For 4-GPU nodes:
N1 = Number of nodes
N2 = MPI processes per node
-l select=N1:ncpus=48:mpiprocs=N2:ngpus=8 For 8-GPU nodes:
N1 = Number of nodes
N2 = MPI processes per node
-l walltime=HHH:MM:SS Maximum wall time

5.5. Launch Commands

To launch an MPI executable use mpiexec. For example:

mpiexec -n #_of_MPI_tasks ./mpijob.exe

For OpenMP executables, no launch command is needed.

5.6. Sample Script

While it is possible to include all PBS directives at the qsub command line, the preferred method is to embed the PBS directives within the batch request script using "#PBS". The following is a sample batch script:


# Declare the project under which this job run will be charged. (required)
# Users can find eligible projects by typing "show_usage" on the command line.
#PBS -A Project_ID

# Request 1 hour of wallclock time for execution.
#PBS -l walltime=01:00:00

# Request nodes.
#PBS -l select=1:ncpus=12:mpiprocs=12:ngpus=2

# Submit job to standard queue.
#PBS -q standard

# Declare a jobname.
#PBS -N myjob

# Send standard output (stdout) and error (stderr) to the same file.
#PBS -j oe

# Change to the working directory.

# Execute a parallel program.

5.7. PBS Commands

The following commands provide the basic functionality for using the PBS batch system:

qsub: Used to submit jobs for batch processing.
qsub [ options ] my_job_script

qstat: Used to check the status of submitted jobs.
qstat PBS_JOBID ## check one job
qstat -u my_user_name ## check all of user's jobs

qdel: Used to kill queued or running jobs.