SuperMicro SuperServer MLA V100 (Vulcanite)
Table of Contents
- 1. Introduction
- 1.1. Document Scope and Assumptions
- 1.2. Obtaining an Account
- 1.3. Requesting Assistance
- 2. System Configuration
- 2.1. System Summary
- 2.2. Operating System
- 2.3. File Systems
- 2.3.1. /home
- 2.3.2. /gpfs/cwfs
- 2.3.3. /scratch
- 3. Accessing the System
- 3.1. Kerberos
- 3.2. Logging In
- 4. User Environment
- 4.1. Modules
- 4.2. Archive Usage
- 4.2.1. Archive Command Synopsis
- 4.3. Available Compilers
- 4.4. Programming Models
- 5. Batch Scheduling
- 5.1. Scheduler
- 5.2. Queue Information
- 5.3. Interactive Batch Sessions
- 5.4. Batch Resource Directives
- 5.5. Launch Commands
- 5.6. Sample Script
- 5.7. PBS Commands
1.1. Document Scope and Assumptions
This document provides an overview and introduction to the use of the SuperMicro SuperServer MLA V100 (Vulcanite) located at the ERDC DSRC, along with a description of the specific computing environment on Vulcanite. The intent of this guide is to provide information that will enable the average user to perform computational tasks on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:
- Use of the UNIX operating system
- Use of an editor (e.g., vi or emacs)
- Remote usage of computer systems via network or modem access
- A selected programming language and its related tools and libraries
1.2. Obtaining an Account
To get an account on Vulcanite, you must first submit a Vulcanite Project Proposal. You will also require an account on the HPCMP Portal to the Information Environment, commonly called a "pIE User Account." Once you have submitted your proposal, if you do not yet have a pIE User Account, please visit HPC Centers: Obtaining An Account and follow the instructions there. If you need assistance with any part of this process, please contact the HPC Help Desk at email@example.com.
1.3. Requesting Assistance
The ERDC DSRC HPC Service Center is available to help users with problems, issues, or questions. Analysts are on duty 8:00 a.m. - 5:00 p.m. Central, Monday - Friday (excluding Federal holidays).
To request assistance, contact the ERDC DSRC directly in any of the following ways:
- E-mail: firstname.lastname@example.org
- Phone: 1-800-500-4722 or (601) 634-4400
For more detailed contact information, please see our Contact Page.
2. System Configuration
2.1. System Summary
Vulcanite is an exploratory system meant to provide users access to a variety of high density GPU node configurations. Each node type has a different number of processors, amount of memory, number of GPUs, amount of SSD storage, and number of network interfaces. Because of this users should take care when migrating between node types.
|Processor||Intel 6126T Skylake||Intel 6126T Skylake||Intel 6136 Skylake||Intel 8160 Skylake|
|Processor Speed||2.6 GHz||2.6 GHz||3.0 GHz||2.1 GHz|
|Sockets / Node||1||1||2||2|
|Cores / Node||12||12||24||48|
|Total CPU Cores||24||312||192||240|
|Useable Memory / Node||206 GB||206 GB||384 GB||764 GB|
|Accelerators / Node||None||2||4||8|
|Accelerator||n/a||NVIDIA V100 PCIe||NVIDIA V100 SXM2||NVIDIA V100 SXM2|
|Memory / Accelerator||n/a||32 GB||32 GB||32 GB|
|Storage on Node||None||2 TB NVMe||4 TB NVMe||8 TB NVMe|
|Interconnect||EDF InfiniBand 1x||EDR InfiniBand 1x||EDR InfiniBand 2x||EDR InfiniBand 4x|
|Path||Formatted Capacity||File System Type||Storage Type||User Quota||Minimum File Retention|
|/home ($HOME)||4 TB||XFS||SATA SSD||30 GB||None|
|/gpfs/cwfs ($CENTER)||3 PB||GPFS||HDD||100 TB||120 Days|
2.2. Operating System
Vulcanite's operating system is RedHat Enterprise Linux 7.
2.3. File Systems
Vulcanite has the following file systems available for user storage:
/home is a locally mounted SSD with an unformatted capacity of 4 TB. All users have a home directory located on this file system, which can be referenced by the environment variable $HOME. /home has a 30 GB quota.
The Center-Wide File System (CWFS) provides file storage that is accessible from all Vulcanite's nodes. The environment variable $CENTER refers to this directory.
/scratch is a file system for temporary, compute node SSD storage. The environment variable $TMPDIR will point to a sub-directory created under /scratch for user access for the duration of a job. This sub-directory will exist on all of the job's compute nodes.
The size of on-node storage in /scratch is approximately:
|2-GPU nodes:||1.3 TB|
|4-GPU nodes:||3.2 TB|
|8-GPU nodes:||6.8 TB|
3. Accessing the System
A Kerberos client kit must be installed on your desktop to enable you to get a Kerberos ticket. Kerberos is a network authentication tool that provides secure communication by using secret cryptographic keys. Only users with a valid HPCMP Kerberos authentication can gain access to Vulcanite. More information about installing Kerberos clients on your desktop can be found at HPC Centers: Kerberos & Authentication.
3.2. Logging In
The login nodes for the Vulcanite cluster are vulcanite01 and vulcanite02.
The preferred way to login to Vulcanite is via ssh, as follows:
% ssh vulcanite.erdc.hpc.mil
4. User Environment
A number of modules are loaded automatically as soon as you log in. To see the modules that are currently loaded, use the "module list" command. To see the entire list of available modules, use the "module avail" command. You can modify the configuration of your environment by loading and unloading modules. For complete information on how to do this, see the Modules User Guide.
4.2. Archive Usage
All of our HPC systems have access to an online archival mass storage system that provides long-term storage for users' files on a petascale tape file system that resides on a robotic tape library system. A 672-TB disk cache frontends the tape file system and temporarily holds files while they are being transferred to or from tape.
Tape file systems have very slow access times. The tapes must be robotically pulled from the tape library, mounted in one of the limited number of tape drives, and wound into position for file archival or retrieval. For this reason, users should always tar up their small files in a large tarball when archiving a significant number of files. A good maximum target size for tarballs is about 200 GB or less. At that size, the time required for file transfer and tape I/O is reasonable. Files larger than 1 TB will span more than one tape, which will greatly increase the time required for both archival and retrieval.
The environment variable $ARCHIVE_HOME is automatically set for you and can be used to reference your mass storage archive directory when using archive commands. The command getarchome can be used to display the value of $ARCHIVE_HOME.
4.2.1. Archive Command Synopsis
A synopsis of the archive utility is listed below. For information on additional capabilities, see the Archive User Guide or read the online man page that is available on each system. This command is non-Kerberized and can be used in batch submission scripts if desired.
Copy one or more files from the archive system:
archive get [-C path] [-s] file1 [file2...]
List files and directory contents on the archive system:
archive ls [lsopts] [file/dir ...]
Create directories on the archive system:
archive mkdir [-C path] [-m mode] [-p] [-s] dir1 [dir2 ...]
Copy one or more files to the archive system:
archive put [-C path] [-D] [-s] file1 [file2 ...]
Move or rename files and directories on the archive server:
archive mv [-C path] [-s] file1 [file2 ...] target
Remove files and directories from the archive server:
archive rm [-C path] [-r] [-s] file1 [file2 ...]
Check and report the status of the archive server:
archive stat [-s]
Remove empty directories from the archive server:
archive rmdir [-C path] [-p] [-s] dir1 [dir2 ...]
Change permissions of files and directories on the archive server:
archive chmod [-C path] [-R] [-s] mode file1 [file2 ...]
Change the group of files and directories on the archive server:
archive chgrp [-C path] [-R] [-h] [-s] group file1 [file2 ...]
4.3. Available Compilers
Vulcanite has the GNU and Intel compilers.
Vulcanite has four MPI suites:
- OpenMPI (GCC)
- MPICH (GCC)
- MVAPICH2 (GCC)
- IMPI (INTEL)
4.4. Programming Models
Vulcanite supports two base programming models: Message Passing Interface (MPI) and Open Multi-Processing (OpenMP). A Hybrid MPI/OpenMP programming model is also supported.
5. Batch Scheduling
The Portable Batch System (PBS) is currently running on Vulcanite.
5.2. Queue Information
Vulcanite only has the Standard queue. The maximum wall clock time is 168 hours.
5.3. Interactive Batch Sessions
To get an interactive batch session, you must first submit an interactive batch job through PBS. This is done by executing a qsub command with the "-I" option from within the interactive login environment. For example:
qsub -l select=N1:ncpus=12:mpiprocs=N2:ngpus=2 -A Project_ID -q standard -l walltime=HHH:MM:SS -I
You must specify the number of nodes requested (N1), the number of processes per node (N2), the desired maximum walltime, your project ID, and a job queue.
Your interactive batch session will be scheduled just as normal batch jobs are scheduled depending on the other queued batch jobs, so it may take quite a while. Once your interactive batch shell starts, you can run or debug interactive applications, post-process data, etc.
At this point, you can run parallel applications on your assigned set of compute nodes. You can also run interactive commands or scripts on this node.
5.4. Batch Resource Directives
Batch resource directives allow you to specify to PBS how your batch jobs should be run and what resources your job requires. Although PBS has many directives, you only need to know a few to run most jobs.
The basic syntax of PBS directives is as follows:
where some options may require values to be included. For example, to start an 8-process job, you would request one node of 12 cores and specify that you will be running 8 processes per node:
#PBS -l select=1:ncpus=12:mpiprocs=8:ngpus=2
The following directives are required for all jobs:
|-A||Project_ID||Name of the project|
|-q||queue_name||Name of the queue|
|-l||select=N1:ncpus=12:mpiprocs=N2:ngpus=2||For 2-GPU nodes:
N1 = Number of nodes
N2 = MPI processes per node
|-l||select=N1:ncpus=24:mpiprocs=N2:ngpus=4||For 4-GPU nodes:
N1 = Number of nodes
N2 = MPI processes per node
|-l||select=N1:ncpus=48:mpiprocs=N2:ngpus=8||For 8-GPU nodes:
N1 = Number of nodes
N2 = MPI processes per node
|-l||walltime=HHH:MM:SS||Maximum wall time|
5.5. Launch Commands
To launch an MPI executable use mpiexec. For example:
mpiexec -n #_of_MPI_tasks ./mpijob.exe
For OpenMP executables, no launch command is needed.
5.6. Sample Script
While it is possible to include all PBS directives at the qsub command line, the preferred method is to embed the PBS directives within the batch request script using "#PBS". The following is a sample batch script:
#!/bin/csh # Declare the project under which this job run will be charged. (required) # Users can find eligible projects by typing "show_usage" on the command line. #PBS -A Project_ID # Request 1 hour of wallclock time for execution. #PBS -l walltime=01:00:00 # Request nodes. #PBS -l select=1:ncpus=12:mpiprocs=12:ngpus=2 # Submit job to standard queue. #PBS -q standard # Declare a jobname. #PBS -N myjob # Send standard output (stdout) and error (stderr) to the same file. #PBS -j oe # Change to the working directory. cd $PBS_O_WORKDIR # Execute a parallel program. ???
5.7. PBS Commands
The following commands provide the basic functionality for using the PBS batch system:
qsub: Used to submit jobs for batch processing.
qsub [ options ] my_job_script
qstat: Used to check the status of submitted jobs.
qstat PBS_JOBID ## check one job
qstat -u my_user_name ## check all of user's jobs
qdel: Used to kill queued or running jobs.