Skip Nav

Cray XC40/50 (Onyx)
User Guide

Table of Contents

1. Introduction

1.1. Document Scope and Assumptions

This document provides an overview and introduction to the use of the Cray XC40/50 (Onyx) located at the ERDC DSRC, along with a description of the specific computing environment on Onyx. The intent of this guide is to provide information that will enable the average user to perform computational tasks on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:

  • Use of the UNIX operating system
  • Use of an editor (e.g., vi or emacs)
  • Remote usage of computer systems via network or modem access
  • A selected programming language and its related tools and libraries

1.2. Policies to Review

Users are expected to be aware of the following policies for working on Onyx.

1.2.1. Login Node Abuse Policy

The login nodes, onyx01-onyx12, provide login access for Onyx and support such activities as compiling, editing, and general interactive use by all users. Consequently, memory or CPU-intensive programs running on the login nodes can significantly affect all users of the system. Therefore, only small applications requiring less than 10 minutes of runtime and less than 8 GBytes of memory are allowed on the login nodes. Any job running on the login nodes that exceeds these limits may be unilaterally terminated.

1.2.2. Workspace Purge Policy

Close management of space in the work file system is a high priority. Files in the work file system that have not been accessed in 30 days are subject to the purge cycle. If available space becomes critically low, a manual purge may be run, and all files in the work file system are eligible for deletion. Using the touch command (or similar commands) to prevent files from being purged is prohibited. Users are expected to keep up with file archival and removal within the normal purge cycles.

Note! If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, you WILL NOT be notified prior to deletion. You are responsible to monitor your workspace to prevent data loss.

1.3. Obtaining an Account

The process of getting an account on the HPC systems at any of the DSRCs begins with getting an account on the HPCMP Portal to the Information Environment, commonly called a "pIE User Account." If you do not yet have a pIE User Account, please visit HPC Centers: Obtaining An Account and follow the instructions there. If you need assistance with any part of this process, please contact the HPC Help Desk at accounts@helpdesk.hpc.mil.

1.4. Requesting Assistance

The HPC Help Desk is available to help users with unclassified problems, issues, or questions. Analysts are on duty 7:00 a.m. - 7:00 p.m. Central, Monday - Friday (excluding Federal holidays).

You can contact the ERDC DSRC directly in any of the following ways for after-hours support:

  • E-mail: dsrchelp@erdc.hpc.mil
  • Phone: 1-800-500-4722 or (601) 634-4400
  • Fax: (601) 634-5126
  • U.S. Mail:
    U.S. Army Engineer Research and Development Center
    ATTN: CEERD-IH-D HPC Service Center
    3909 Halls Ferry Road
    Vicksburg, MS 39180-6199

For more detailed contact information, please see our Contact Page.

2. System Configuration

2.1. System Summary

Onyx is a Cray XC40/50. It has 12 login nodes, each with two 2.8-GHz Intel Xeon Broadwell 22-core processors and 256 GBytes of memory. It has 2,858 standard compute nodes, 4 large-memory compute nodes, 32 GPU compute nodes, and 544 Knights Landing (KNL) compute nodes (a total of 3,438 compute nodes or 161,448 compute cores). Each standard compute node has two 2.8-GHz Intel Xeon Broadwell 22-core processors (44 cores) and 128 GBytes of DDR4 memory. Each large-memory compute node has two 2.8-GHz Intel Xeon Broadwell 22-core processors (44 cores) and one TByte of DDR4 memory. Each GPU compute node has one 2.8-GHz Intel Xeon Broadwell 22-core processor, an NVIDIA Pascal GPU (Tesla P100), and 256 GBytes of DDR4 memory. Each KNL compute node has one 1.3-GHz Intel Knights Landing 64-core processor and 96 GBytes of DDR4 memory.

Compute nodes are interconnected by the Cray Aries high-speed network, and have Intel's Turbo Boost and Hyper-Threading Technology enabled. Memory is shared by cores on each node, but not between nodes.

Onyx uses Lustre to manage its high-speed, parallel, Sonexion file system, which has 15 PBytes (formatted) of disk storage.

Onyx is intended to be used as a batch-scheduled HPC system. Its login nodes are not to be used for large computational (e.g., memory, IO, long executions) work. All executions that require large amounts of system resources must be sent to the compute nodes by batch job submission.

Node Configuration
Login Nodes Compute Nodes
Standard Memory Large Memory KNL GPU
Accelerated
Total Nodes 12 4,810 4 32 32
Operating System SLES Cray Linux Environment
Cores/Node 22 44 64 22 + 1 GPU
(1 x 3,584 GPU cores)
Core Type Intel Xeon
E5-2699v4 Broadwell
Intel Xeon Phi 7230
(Knights Landing)
Intel Xeon
E5-2699v4 Broadwell
+NVIDIA Pascal GPU (Tesla P100)
Core Speed 2.8 GHz 1.3 GHz 2.8 GHz
Memory/Node 256 GBytes 128 GBytes 1 TByte 96 GBytes 256 GBytes
+16 GBytes
Accessible Memory/Node 8 GBytes 122 GBytes 998 GBytes 90 GBytes 247 GBytes
+16 GBytes
Memory Model Shared on node. Shared on node.
Distributed across cluster.
Interconnect Type Ethernet Cray Aries
File Systems on Onyx
Path Capacity Type
/p/home
($HOME)
897 TBytesLustre
/p/work
($WORKDIR)
13 PBytesLustre

2.2. Processors

Onyx uses 2.8-GHz Intel Xeon E5-2699v4 Broadwell processors on its login nodes. There are two processors per node, each with twenty-two cores, for a total of 44 cores per login node. These processors have 64 KBytes of L1 cache per core, 256 KBytes of L2 cache per core, and 55 MBytes of shared L3 cache.

Onyx uses 2.8-GHz Intel Xeon E5-2699v4 Broadwell processors on its standard compute nodes. There are two processors per node, each with twenty-two cores, for a total of 44 cores per node. All cores on a processor share a 55-MByte L3 cache. Each processor has one NUMA (non-uniform memory access).

Onyx's GPU nodes use a 2.8-GHz Intel Xeon E5-2699v4 Broadwell processor and one NVIDIA Pascal GPU (Tesla P100). The GPU features 3,584 cores operating at 1.3-GHz.

Onyx's large-memory nodes are equipped with two 2.8-GHz Intel Xeon E5-2699v4 Broadwell processors.

Single-core (serial) jobs running on the above Broadwell processors may take advantage of Intel's Turbo Boost, which results in a speed increase to 3.6 GHz.

Onyx's KNL nodes have one 1.3-GHz Intel Xeon Phi 7230 (Knights Landing) processor.

2.3. Memory

Onyx uses both shared- and distributed-memory models. Memory is shared among all the cores on a node, but is not shared among the nodes across the cluster.

Each login node contains 256 GBytes of main memory. All memory and cores on the node are shared among all users who are logged in. Therefore, users should not use more than 8 GBytes of memory at any one time.

Each standard compute node contains 128 GBytes of memory, of which 122 GBytes is user-accessible shared memory.

Each large-memory node features 1 TByte of memory, of which 998 GBytes are user-accessible shared memory.

Each GPU node has 256 GBytes of memory, of which 247 GBytes is user-accessible shared memory. The GPU accelerator has 16 GBytes of accessible memory.

Each KNL node has 96 GBytes of memory, of which 16 GBytes consists of very high bandwidth memory, called MCDRAM. The default memory configuration is that MCDRAM acts as a cache for main memory. The user-accessible user memory is 90 GBytes.

2.4. Operating System

The operating system on Onyx's login nodes is SUSE Enterprise Linux. The compute nodes use a reduced functionality Linux kernel that is designed for computational computing. The combination of these two operating systems is known as the Cray Linux Environment (CLE). The compute nodes can provide access to dynamically shared objects and most of the typical Linux commands and basic functionality by including the Cluster Compatibility Mode (CCM) option in your PBS batch submission script or command. See section 6.5 for more information on using CCM.

2.5. File Systems

Onyx has the following file systems available for user storage:

2.5.1. /p/home

This file system is locally mounted from Onyx's Lustre file system. It has a formatted capacity of 897 TBytes. All users have a home directory located on this file system which can be referenced by the environment variable $HOME.

2.5.2. /p/work

This file system is locally mounted from Onyx's Lustre file system that is tuned for parallel I/O. It has a formatted capacity of 13 PBytes. All users have a work directory on this file system which can be referenced by the environment variable $WORKDIR.

Raid/Striping Concerns for Large Files

It is important to note that the work file system is a parallel, striped file system. This means that as files are written, they are automatically divided into chunks and written across multiple disk sets, or "OSTs," simultaneously. This process, called "striping," plays a vital role in running very large jobs because it significantly improves file I/O speed. Without parallel striping, large jobs, many of which require hundreds of GBytes of disk space, would spend much of their time just reading from and writing to disk.

The default stripe size for files in the work file systems is 1 MByte, and the default stripe count is one stripe. Increasing the stripe count is advisable when creating files that are larger than 100 MBytes. The I/O strategy of the application is also a factor in deciding the stripe count. As a rule, when writing a large volume of data, an application should try to use all the OSTs. If writing a single file, set the stripe count to the number of OSTs. When writing more files than there are OSTs, set the stripe count to 1. If the number of files being written is fewer than the number of OSTs, set the stripe count so that all OSTs will be used. For an explanation of how to do this and recommendations for striping specific file sizes, see the document, "Lustre File Systems."

You can also see $SAMPLES_HOME/Data_Management/OST_Stripes on Onyx for an example of how to increase stripe counts in a batch job.

2.5.3. /p/cwfs

This path is directed to the Center-Wide File System (CWFS), which is meant for short-term storage (no longer than 120 days). All users have a directory defined in this file system. The environment variable for this is $CENTER. This is accessible from the HPC systems login nodes and the HPC Portal. The CWFS has a formatted capacity of 3300 TBytes and is managed by IBM's Spectrum Scale (formerly GPFS).

2.6. Peak Performance

Onyx is rated at 6.06 peak PFLOPS = 3.4 (Broadwell) + 0.85 (KNL). This equates to 28.2 GFLOPS per core (Broadwell) and 25.6 GFLOPS per core (KNL). Overall, Onyx has peak performance of 27.6 GFLOPS per core.

3. Accessing the System

3.1. Kerberos

A Kerberos client kit must be installed on your desktop to enable you to get a Kerberos ticket. Kerberos is a network authentication tool that provides secure communication by using secret cryptographic keys. Only users with a valid HPCMP Kerberos authentication can gain access to Onyx. More information about installing Kerberos clients on your desktop can be found at HPC Centers: Kerberos & Authentication.

3.2. Logging In

The system host name for the Onyx cluster is onyx.erdc.hpc.mil, which will redirect the user to one of twelve (12) login nodes. Hostnames and IP addresses to these nodes are available upon request from the HPC Help Desk.

The preferred way to login to Onyx is via ssh, as follows:

% ssh onyx.erdc.hpc.mil

Kerberized rlogin is also allowed.

3.3. File Transfers

File transfers to DSRC systems (except transfers to the local archive system) must be performed using Kerberized versions of the following tools: scp, mpscp, sftp, ftp, and kftp. Before using any Kerberized tool, you must use a Kerberos client to obtain a Kerberos ticket. Information about installing and using a Kerberos client can be found at HPC Centers: Kerberos & Authentication.

The command below uses secure copy (scp) to copy a single local file into a destination directory on an Onyx login node. The mpscp command is similar to the scp command, but it has a different underlying means of data transfer and may enable a greater transfer rate. The mpscp command has the same syntax as scp.

% scp local_file user@onyx.erdc.hpc.mil:/target_dir

Both scp and mpscp can be used to send multiple files. This example command transfers all files with the .txt extension to the same destination directory. More information about mpscp can be found at the Overview of mpscp page.

% scp *.txt user@onyx.erdc.hpc.mil:/target_dir

The example below uses the secure file transfer protocol (sftp) to connect to Onyx, then uses the sftp cd and put commands to change to the destination directory and copy a local file there. The sftp quit command ends the sftp session. Use the sftp help command to see a list of all sftp commands.

% sftp user@onyx.erdc.hpc.mil

sftp> cd target_dir
sftp> put local_file
sftp> quit

The Kerberized file transfer protocol (kftp) command differs from sftp in that your username is not specified on the command line, but given later when prompted. The kftp command may not be available in all environments.

% kftp onyx.erdc.hpc.mil

username> user
kftp> cd target_dir
kftp> put local_file
kftp> quit

Windows users may use a graphical file transfer protocol (ftp) client such as FileZilla.

4. User Environment

4.1. User Directories

The following user directories are provided for all users on Onyx.

4.1.1. Home Directory

When you log on to Onyx, you will be placed in your home directory, /p/home/username. The environment variable $HOME is automatically set for you and refers to this directory. $HOME is visible to both the login and compute nodes. It has a 30-GByte quota and is periodically backed up. $HOME is not to be used for running jobs.

4.1.2. Work Directory

Onyx has a large file system (/p/work) for temporary storage of data files needed for executing programs. You may access your personal working directory, located on this file system, by using the $WORKDIR environment variable, which is set for you upon login. Your $WORKDIR directory has no disk quota. Because of high usage, the work file system tends to fill up frequently. Please review the Purge Policy and be mindful of your disk usage.

REMEMBER: The work file system is a "scratch" file system and is not backed up. You are responsible for managing files in your $WORKDIR by backing up files to the archive system and deleting unneeded files when your jobs end. See the section below on Archive Usage for details.

All of your jobs should execute from your $WORKDIR directory, not from $HOME. While not technically forbidden, jobs that are run from $HOME are subject to disk space quotas and have much poorer performance.

To avoid unusual errors that can arise from two jobs using the same scratch directory, a common technique is to create a unique subdirectory for each batch job by including the following lines in your batch script:

TMPD=${WORKDIR}/${PBS_JOBID}
mkdir -p ${TMPD}

4.1.3. Center Directory

The Center-Wide File System (CWFS) provides file storage that is accessible from Onyx's login nodes, and from the HPC Portal. The CWFS allows file transfers and other file and directory operations from Onyx using simple Linux commands. Each user has their own directory in the CWFS. The name of your CWFS directory may vary between machines and between centers, but the environment variable $CENTER will always refer to this directory.

The example below shows how to copy a file from your work directory on Onyx to the CWFS ($CENTER).

While logged into Onyx, copy your file from your Onyx work directory to the CWFS.

% cp $WORKDIR/filename $CENTER

4.2. Shells

The following shells are available on Onyx: csh, bash, ksh, tcsh, zsh, and sh. To change your default shell, please email a request to require@hpc.mil. Your preferred shell will become your default shell on the Onyx cluster within 1-2 working days.

4.3. Environment Variables

A number of environment variables are provided by default on all HPCMP HPC systems. We encourage you to use these variables in your scripts where possible. Doing so will help to simplify your scripts and reduce portability issues if you ever need to run those scripts on other systems.

4.3.1. Common Environment Variables

The following environment variables are common to both the login and batch environments:

Common Environment Variables
Variable Description
$ARCHIVE_HOME Your directory on the archive system.
$ARCHIVE_HOST The host name of the archive system.
$BC_HOST The generic (not node specific) name of the system.
$CC The currently selected C compiler. This variable is automatically updated when a new compiler environment is loaded.
$CENTER Your directory on the Center-Wide File System (CWFS).
$COST_HOME This variable contains the path to the base directory of the default installation of the Common Open Source Tools (COST) installed on a particular compute platform. (See BC policy FY13-01 for COST details.)
$CSI_HOME The directory containing the following list of heavily used application packages: ABAQUS, Accelrys, ANSYS, CFD++, Cobalt, EnSight, Fluent, GASP, Gaussian, LS-DYNA, MATLAB, and TotalView, formerly known as the Consolidated Software Initiative (CSI) list. Other application software may also be installed here by our staff.
$CXX The currently selected C++ compiler. This variable is automatically updated when a new compiler environment is loaded.
$DAAC_HOME The directory containing the DAAC-supported visualization tools: ParaView, VisIt, and EnSight.
$F77 The currently selected Fortran 77 compiler. This variable is automatically updated when a new compiler environment is loaded.
$F90 The currently selected Fortran 90 compiler. This variable is automatically updated when a new compiler environment is loaded.
$HOME Your home directory on the system.
$JAVA_HOME The directory containing the default installation of JAVA.
$KRB5_HOME The directory containing the Kerberos utilities.
$PET_HOME The directory containing the tools formerly installed and maintained by the PET staff. This variable is deprecated and will be removed from the system in the future. Certain tools will be migrated to $COST_HOME, as appropriate.
$PROJECTS_HOME A common directory where group-owned and supported applications and codes may be maintained for use by members of a group. Any project may request a group directory under $PROJECTS_HOME.
$SAMPLES_HOME The Sample Code Repository. This is a collection of sample scripts and codes provided and maintained by our staff to help users learn to write their own scripts. There are a number of ready-to-use scripts for a variety of applications.
$WORKDIR Your work directory on the local temporary file system (i.e., local high-speed disk).
4.3.2. Batch-Only Environment Variables

In addition to the variables listed above, the following variables are automatically set only in your batch environment. That is, your PBS batch scripts will be able to see them when they run. These variables are supplied for your convenience and are intended for use inside your batch scripts.

Batch-Only Environment Variables
Variable Description
$BC_CORES_PER_NODE The number of cores per node for the compute node on which a job is running.
$BC_MEM_PER_NODE The approximate maximum user-accessible memory per node (in integer MBytes) for the compute node on which a job is running.
$BC_MPI_TASKS_ALLOC The number of MPI tasks allocated for a job.
$BC_NODE_ALLOC The number of nodes allocated for a job.

4.4. Modules

Software modules are a convenient way to set needed environment variables and include necessary directories in your path so that commands for particular applications can be found. Onyx also uses modules to initialize your environment with application software, system commands, libraries, and compiler suites.

A number of modules are loaded automatically as soon as you log in. To see the modules which are currently loaded, use the "module list" command. To see the entire list of available modules, use the "module avail" command. You can modify the configuration of your environment by loading and unloading modules. For complete information on how to do this, see the Modules User Guide.

4.5. Archive Usage

All of our HPC systems have access to an online archival mass storage system that provides long-term storage for users' files on a petascale tape file system that resides on a robotic tape library system. A 672-TByte disk cache frontends the tape file system and temporarily holds files while they are being transferred to or from tape.

Tape file systems have very slow access times. The tapes must be robotically pulled from the tape library, mounted in one of the limited number of tape drives, and wound into position for file archival or retrieval. For this reason, users should always tar up their small files in a large tarball when archiving a significant number of files. A good maximum target size for tarballs is about 200 GBytes or less. At that size, the time required for file transfer and tape I/O is reasonable. Files larger than 1 TByte will span more than one tape, which will greatly increase the time required for both archival and retrieval.

The environment variable $ARCHIVE_HOME is automatically set for you and can be used to reference your mass storage archive directory when using archive commands. The command getarchome can be used to display the value of $ARCHIVE_HOME.

4.5.1. Archive Command Synopsis

A synopsis of the archive utility is listed below. For information on additional capabilities, see the Archive User Guide or read the online man page that is available on each system. This command is non-Kerberized and can be used in batch submission scripts if desired.

Copy one or more files from the archive system:

archive get [-C path] [-s] file1 [file2...]

List files and directory contents on the archive system:

archive ls [lsopts] [file/dir ...]

Create directories on the archive system:

archive mkdir [-C path] [-m mode] [-p] [-s] dir1 [dir2 ...]

Copy one or more files to the archive system:

archive put [-C path] [-D] [-s] file1 [file2 ...]

Move or rename files and directories on the archive server:

archive mv [-C path] [-s] file1 [file2 ...] target

Remove files and directories from the archive server:

archive rm [-C path] [-r] [-s] file1 [file2 ...]

Check and report the status of the archive server:

archive stat [-s]

Remove empty directories from the archive server:

archive rmdir [-C path] [-p] [-s] dir1 [dir2 ...]

Change permissions of files and directories on the archive server:

archive chmod [-C path] [-R] [-s] mode file1 [file2 ...]

Change the group of files and directories on the archive server:

archive chgrp [-C path] [-R] [-h] [-s] group file1 [file2 ...]

5. Program Development

5.1. Programming Models

Onyx supports five parallel programming models: Message Passing Interface (MPI), Shared-Memory (SHMEM), Open Multi-Processing (OpenMP), and Partitioned Global Address Space (PGAS): Co-Array FORTRAN, and Unified Parallel C (UPC). A Hybrid MPI/OpenMP programming model is also supported. MPI and SHMEM are examples of the message- or data-passing models, while OpenMP uses only shared memory on a node by spawning threads. PGAS programming using Co-Array FORTRAN and Unified Parallel C shares partitioned address space, where variables and arrays can be directly addressed by any processing element.

Cray provides the XC Series Programming Environment User Guide (17.05) S-2529 as a complete programmer's reference for all supported programming models.

5.1.1. Message Passing Interface (MPI)

This release of MPI-2 derives from Argonne National Laboratory MPICH-2 and implements the MPI-2.2 standard except for spawn support, as documented by the MPI Forum in "MPI: A Message Passing Interface Standard, Version 2.2."

The Message Passing Interface (MPI) is part of the software support for parallel programming across a network of computer systems through a technique known as message passing. MPI establishes a practical, portable, efficient, and flexible standard for message passing that makes use of the most attractive features of a number of existing message-passing systems, rather than selecting one of them and adopting it as the standard. See "man intro_mpi" for additional information.

When creating an MPI program on Onyx, ensure the following:

  • That the default MPI module (cray-mpich) has been loaded. To check this, run the "module list" command. If cray-mpich is not listed and a different MPI module is listed, use the following command:

    module swap other_mpi_module cray-mpich

    If no MPI module is loaded, load the cray-mpich module.

    module load cray-mpich

  • That the source code includes one of the following lines:
    INCLUDE "mpif.h"        ## for Fortran, or
    #include <mpi.h>        ## for C/C++

To compile an MPI program for the Intel Broadwell nodes, do the following:

  • Ensure the craype-broadwell module is loaded. This module is loaded by default.
  • Compile using one of the following command examples:
    ftn -o mpi_program mpi_program.f         ## for Fortran, or
    cc -o mpi_program mpi_program.c          ## for C
    CC -o mpi_program mpi_program.cpp        ## for C++

The program can then be launched using the aprun command, as follows:

aprun -n mpi_procs mpi_program [user_arguments]

where mpi_procs is the number of MPI processes being started. For example:

#### starts 88 mpi processes; 44 on each node, one per core
## request 2 nodes, each with 44 cores and 44 processes per node
#PBS -l select=2:ncpus=44:mpiprocs=44
aprun -n 88 ./a.out
Accessing More Memory Per MPI Process

By default, one MPI process is started on each core of a node. This means that on Onyx, the available memory on the node is split 44 ways. A common concern for MPI users is the need for more memory for each process. To allow an individual process to use more of the node's memory, you need to allow some cores to remain idle, using the "-N" option, as follows:

aprun -n mpi_procs -N mpi_procs_per_node mpi_program [user_args]

where mpi_procs_per_node is the number of MPI processes to be started on each node. For example:

####   starts 44 mpi processes; only 22 on each node
## request 2 nodes, each with 44 cores and 22 processes per node
#PBS -l select=2:ncpus=44:mpiprocs=22
aprun -n 44 -N 22 ./a.out  ## (assigns only 22 processes per node)

For more information about aprun, see the aprun man page.

To compile an MPI program for the KNL nodes, do the following:

  • Ensure the craype-mic-knl and craype-hugepages2M modules are loaded.
    module swap craype-broadwell craype-mic-knl
    module load craype-hugepages2M

    A different craype-hugepagesSIZE can be used, where SIZE can vary in MBytes from 2M to 512M. Run "module avail" for a complete list of huge pages. The size of 2 MBytes is the runtime default on KNL nodes.

  • Compile using one of the following command examples:
    ftn -o mpi_program mpi_program.f         ## for Fortran, or
    cc -o mpi_program mpi_program.c          ## for C
    CC -o mpi_program mpi_program.cpp        ## for C++

The program can then be launched using the aprun command, as follows:

aprun -n mpi_procs mpi_program [user_arguments]

where mpi_procs is the number of MPI processes being started. For example:

#### starts 128 mpi processes; 64 on each node, one per core
## request 2 nodes, each with 64 cores and 64 processes per node
#PBS -l select=2:ncpus=64:mpiprocs=64:nmics=1
aprun -n 128 ./a.out
5.1.2. Shared Memory (SHMEM)

These logically shared, distributed-memory access (SHMEM) routines provide high-performance, high-bandwidth communication for use in highly parallelized scalable programs. The SHMEM data-passing library routines are similar to the MPI library routines: they pass data between cooperating parallel processes. The SHMEM data-passing routines can be used in programs that perform computations in separate address spaces and that explicitly pass data to and from different processes in the program.

The SHMEM routines minimize the overhead associated with data-passing requests, maximize bandwidth, and minimize data latency. Data latency is the length of time between a process initiating a transfer of data and that data becoming available for use at its destination.

SHMEM routines support remote data transfer through "put" operations that transfer data to a different process and "get" operations that transfer data from a different process. Other supported operations are work-shared broadcast and reduction, barrier synchronization, and atomic memory updates. An atomic memory operation is an atomic read and update operation, such as a fetch and increment, on a remote or local data object. The value read is guaranteed to be the value of the data object just prior to the update. See "man intro_shmem" for details on the SHMEM library after swapping to the cray-shmem module (covered below).

When creating a pure SHMEM program on Onyx, ensure the following:

  • That the default MPI module (cray-mpich) is not loaded. To check this, run the "module list" command. If cray-mpich is listed, use the following command:

    module unload cray-mpich

  • That the logically shared distributed memory access routines (module cray-shmem) are loaded. To check this, run the "module list" command. If cray-shmem is not listed, use the following command:

    module load cray-shmem

  • That the source code includes one of the following lines:

    INCLUDE 'mpp/shmem.fh'  ## for Fortran, or
    #include <mpp/shmem.h>  ## for C/C++

To compile a SHMEM program, use the following examples:

ftn -o shmem_program shmem_program.f90   ## for Fortran
cc -o shmem_program shmem_program.c      ## for C
CC -o shmem_program shmem_program.cpp    ## for C++

The ftn, cc, and CC wrappers resolve all SHMEM routine calls automatically. Specific mention of the SHMEM library is not required on the compilation line.

The program can then be launched using the aprun command, as follows:

aprun -n N shmem_program [user_arguments]

where N is the number of processes being started, with each process utilizing one core. The aprun command launches executables across a set of compute nodes. When each member of the parallel application has exited, aprun exits. For more information about aprun, type "man aprun".

5.1.3. Open Multi-Processing (OpenMP)

OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications. It supports shared-memory multiprocessing programming in C, C++, and Fortran, and consists of a set of compiler directives, library routines, and environment variables that influence compilation and run-time behavior.

When creating an OpenMP program on Onyx, ensure the following:

  • If using OpenMP functions (for example, omp_get_wtime), that the source code includes one of the following lines:

    INCLUDE 'omp.h'     ## for Fortran, or
    #include <omp.h>    ## for C/C++

    Or, if the code is written in Fortran 90 or later, that the following line may be used instead:

    USE omp_lib

  • That the compile command includes an option to reference the OpenMP library. The Cray, Intel, and GNU compilers support OpenMP, and each one uses a different option.

To compile an OpenMP program for the Intel Broadwell nodes, do the following:

  • Ensure the craype-broadwell module is loaded. This module is loaded by default.
  • Compile using one of the following command examples:

    For C codes:

    cc -o OpenMP_program OpenMP_program.c       ## Cray 
    cc -o OpenMP_program -qopenmp OpenMP_program.c      ## Intel
    cc -o OpenMP_program -fopenmp OpenMP_program.c     ## GNU

    For C++ codes:

    CC -o OpenMP_program OpenMP_program.cpp     ## Cray 
    CC -o OpenMP_program -qopenmp OpenMP_program.cpp    ## Intel
    CC -o OpenMP_program -fopenmp OpenMP_program.cpp   ## GNU

    For Fortran codes:

    ftn -o OpenMP_program OpenMP_program.f      ## Cray 
    ftn -o OpenMP_program -qopenmp OpenMP_program.f     ## Intel
    ftn -o OpenMP_program -fopenmp OpenMP_program.f    ## GNU
  • See section 5.2 for additional information on available compilers.

    When running OpenMP applications, the $OMP_NUM_THREADS environment variable must be used to specify the number of threads. For example:

    #### starts 64 threads on one node, one per core
    #PBS -l select=1:ncpus=64:mpiprocs=1:nmics=1
    export OMP_NUM_THREADS=64
    aprun -d 64 ./OpenMP_program [user_arguments]

    Note, when the Intel compiler is used, set the environment variable $KMP_AFFINITY to either "compact or scatter", for example, using the bash shell syntax.

    export KMP_AFFINITY=compact # either compact
    export KMP_AFFINITY=scatter # or scatter

    The aprun command line must also contain the option "-cc none".

    In the example above, the application starts OpenMP_program on one node and spawns a total of 44 threads. Since Onyx has 44 cores per compute node, this yields 1 thread per core.

    To compile an OpenMP program for the KNL nodes, do the following:

    • Ensure the craype-mic-knl and craype-hugepages2M modules are loaded.
      module swap craype-broadwell craype-mic-knl
      module load craype-hugepages2M

      A different craype-hugepagesSIZE can be used, where SIZE can vary in MBytes from 2M to 512M. Run "module avail" for a complete list of huge pages. The size of 2 MBytes is the runtime default on KNL nodes.

    • Compile using one of the following command examples:

      For C codes:

      cc -o OpenMP_program OpenMP_program.c       ## Cray 
      cc -o OpenMP_program -qopenmp OpenMP_program.c     ## Intel
      cc -o OpenMP_program -fopenmp OpenMP_program.c     ## GNU

      For C++ codes:

      CC -o OpenMP_program OpenMP_program.cpp     ## Cray 
      CC -o OpenMP_program -qopenmp OpenMP_program.cpp   ## Intel
      CC -o OpenMP_program -fopenmp OpenMP_program.cpp   ## GNU

      For Fortran codes:

      ftn -o OpenMP_program OpenMP_program.f      ## Cray 
      ftn -o OpenMP_program -qopenmp OpenMP_program.f    ## Intel
      ftn -o OpenMP_program -fopenmp OpenMP_program.f    ## GNU

    When running OpenMP applications, the $OMP_NUM_THREADS environment variable must be used to specify the number of threads. For example:

    export OMP_NUM_THREADS=64
    aprun -d 64 ./OpenMP_program [user_arguments]

    Note, when the Intel compiler is used, the aprun command line must contain the option "-cc none".

    In the example above, the application starts OpenMP_program on one node and spawns a total of 64 threads. Since the KNL nodes have 64 cores each, this yields 1 thread per core.

    5.1.4. Hybrid Processing (MPI/OpenMP)

    An application built with the hybrid model of parallel programming can run on Onyx using both OpenMP and Message Passing Interface (MPI). In hybrid applications, OpenMP threads can be spawned by MPI processes, but MPI calls should not be issued from OpenMP parallel regions or by an OpenMP thread.

    When creating a hybrid (MPI/OpenMP) program on Onyx, follow the instructions in the MPI and OpenMP sections above for creating your program. Then use the compilation instructions for OpenMP.

    Use the aprun command and the $OMP_NUM_THREADS environment variable to run a hybrid program. You may need aprun options "-n", "-N", and "-d" to get the desired combination of MPI processes, nodes, and cores.

    aprun -n mpi_procs -N mpi_procs_per_node -d threads_per_mpi_proc mpi_program

    Note the product of mpi_procs_per_node and threads_per_mpi_proc (-N * -d) should not exceed 44 for Intel Broadwell nodes or 64 for KNL nodes.

    In the following example, we want to run 8 MPI processes on Broadwell nodes, and each MPI process needs about half the memory available on a node. We therefore request 4 nodes (176 cores). We also want each MPI process to launch 8 OpenMP threads, so we set the environment variable accordingly and assign 8 threads per MPI process in the aprun command.

    ####  MPI/OpenMP on 4 nodes, 8 MPI processes total with 8 threads each
    ## request 4 nodes, each with 44 cores and 2 processes per node
    #PBS -l select=4:ncpus=44:mpiprocs=2
    export OMP_NUM_THREADS=8      ## create 8 threads per MPI process
    aprun -n 8 -N 2 -d 8 ./xthi.x ## assigns 8 MPI processes with 2 MPI processes per node

    Note, when the Intel compiler is used, the aprun command line must contain the option "-cc none". For the Cray compiler "-cc depth" is recommended.

    Alternatively, if we wanted each MPI process to launch 22 threads, we would do the following:

    #PBS -l select=4:ncpus=44:mpiprocs=2        
    export OMP_NUM_THREADS=22  
    aprun -n 8 -N 2 -d 22 ./xthi.x

    In this example, each node gets two MPI processes, and all cores are assigned a thread. See the aprun man page for more detail on how MPI processes and threads are allocated on the nodes.

    5.1.5. Partitioned Global Address Space (PGAS)

    The Cray Fortran compiler supports Co-Array Fortran (CAF), and the Cray C compiler supports Unified Parallel C (UPC). These are PGAS extensions that enable the user to reference memory locations on any node, without the need for message-passing protocols. This can greatly simplify writing and debugging a parallel code. These compilers also allow the user to combine PGAS programming constructs with the flexibility of message-passing protocols. The PGAS extensions are not available for the Intel or GNU compilers.

    Cray Fortran and C reference manuals currently refer the reader to external sources for details on the CAF and UPC concepts and syntax. Users should follow the Cray document links in the Links to Vendor Documentation at the end of this guide to locate these reference manuals. The manuals will provide further links to the external sources.

    Compilation of UPC and CAF codes is straightforward. Simply use the standard Cray compilers with the following flags:

    ftn -o myprog -h caf  myprog.f     ## for Fortran
    cc -o myprog -h upc  myprog.c      ## for C
    CC -o myprog -h upc  myprog.cpp    ## for C++

    Use the aprun command to execute the program as described above for MPI programs:

    #PBS -l select=2:ncpus=44:mpiprocs=44
    aprun -n 88 ./myprog

    Many users of PGAS extensions will also use MPI or SHMEM calls in their codes. In such cases, be sure to use the appropriate include statements in your source code, as described above.

    5.2. Available Compilers

    Onyx has three programming environment suites:

    • Cray Fortran and C/C++
    • Intel
    • GNU

    On Onyx, different sets of compilers are used to compile codes for serial vs. parallel execution.

    Compiling for the Compute Nodes

    Codes compiled to run on the compute nodes may be serial or parallel. To compile codes for execution on the compute nodes, the same compile commands are available in all programming environment suites as shown in the following table:

    Compute Node Compiler Commands
    Language Cray Intel GNU Serial/Parallel
    C cc cc cc Serial/Parallel
    C++ CC CC CC Serial/Parallel
    Fortran 77 f77 f77 f77 Serial/Parallel
    Fortran 90 ftn ftn ftn Serial/Parallel
    Compiling for the Login Nodes

    Codes may be compiled to run on the login nodes by using serial compiler commands from the table below.

    Serial-Only Compiler Commands
    Language Cray Intel GNU Serial/Parallel
    C cc icc gcc Serial
    C++ CC icpc g++ Serial
    Fortran 77 ftn ifort gfortran Serial
    Fortran 90 ftn ifort gfortran Serial
    Changing Compiler Suites

    The Cray programming environment is loaded for you by default. To use a different suite, you will need to swap modules. See Relevant Modules (below) to learn how.

    5.2.1. Cray Compiler Environment

    The Cray compiler has a long tradition of high performance compilers for excellent vectorization (it vectorizes more loops than other compilers) and cache optimization (automatic blocking and automatic management of what stays in cache).

    The Partitioned address space (PGAS) languages such as Unified Parallel C (UPC) and Co-Array Fortran are supported on Onyx via the Cray compiler.

    The following table lists some of the more common options that you may use:

    Cray Compiler Options
    OptionPurpose
    -c Generate intermediate object file but do not attempt to link.
    -I directory Search in directory for include or module files.
    -L directory Search in directory for libraries.
    -o outfile Name executable "outfile" rather than the default "a.out".
    -Olevel Set the optimization level. For more information on optimization, see the section on Profiling and Optimization.
    -f free Process Fortran codes using free form.
    -h byteswapio Big-endian files; the default is for little-endian.
    -g Generate symbolic debug information.
    -s integer64
    -s real64
    Treat integer and real variables as 64-bit.
    -s default64 Pass -s integer64, -s real64 to compiler.
    -h noomp Disable OpenMP directives.
    -h upc ( only C) Recognize UPC.
    -h caf Recognize Co-Array Fortran.
    -h dynamic Use shared objects. With ftn, cc, or CC wrappers use "-dynamic".
    -Ktrap=* Trap errors such as floating point, overflow, and divide by zero (see man page).
    -fPIC Generate position-independent code for shared libraries.

    Detailed information about these and other compiler options is available in the Cray compiler (ftn, cc, and CC) man pages on Onyx.

    5.2.2. Intel Compiler Environment

    The following table lists some of the more common options that you may use:

    Intel Compiler Options
    OptionPurpose
    -c Generate intermediate object file but do not attempt to link.
    -I directory Search in directory for include or module files.
    -L directory Search in directory for libraries.
    -o outfile Name executable "outfile" rather than the default "a.out".
    -Olevel Set the optimization level. For more information on optimization, see the section on Profiling and Optimization.
    -free Process Fortran codes using free form.
    -convert big_endian Big-endian files; the default is for little-endian.
    -g Generate symbolic debug information.
    -qopenmp Recognize OpenMP directives.
    -Bdynamic Use shared objects. With ftn, cc, or CC wrappers use "-dynamic".
    -fpe-all=0 Trap floating point, divide by zero, and overflow exceptions.
    -fPIC Generate position-independent code for shared libraries.
    -assume buffered_io
    -assume nobuffered_io
    Alters buffering and may provide I/O performance for certain applications.

    Detailed information about these and other compiler options is available in the Intel compiler (ifort, icc, and icpc) man pages on Onyx.

    5.2.3. GNU Compiler Collection

    The GNU Programming Environment provides a large number of options that are the same for all compilers in the suite. The following table lists some of the more common options that you may use:

    GNU Compiler Options
    OptionPurpose
    -c Generate intermediate object file but do not attempt to link.
    -I directory Search in directory for include or module files.
    -L directory Search in directory for libraries.
    -o outfile Name executable "outfile" rather than the default "a.out".
    -Olevel Set the optimization level. For more information on optimization, see the section on Profiling and Optimization.
    -g Generate symbolic debug information.
    -fconvert=big-endian Big-endian files; the default is for little-endian.
    -Wextra
    -Wall
    Turns on increased error reporting.

    Detailed information about these and other compiler options is available in the GNU compiler (gfortran, gcc, and g++) man pages on Onyx.

    5.2.4. Static Versus Dynamic Executables

    On Onyx, the default for linking using the ftn, cc, or CC wrappers is to create a static executable. The option "-dynamic" will select the dynamic link option that is appropriate for the underlying compiler. Moreover, rather than change Makefile files or other build procedures, the user can utilize the environment variable $CRAYPE_LINK_TYPE to set the default linker behavior. To set the default' to dynamic linking, do the following:

    export CRAYPE_LINK_TYPE=dynamic

    Cray-supplied applications such as netcdf, hdf5, and fftw do have static libraries. On the other hand, for some basic operating system libraries such as those for X11, static libraries are not available. When using those libraries, setting $CRAYPE_LINK_TYPE to "dynamic" is necessary.

    It should be noted that for large arrays, it is sometimes necessary to compile with "-mcmodel=medium" or "-mcmodel=large". Also, in these two cases the linking must be dynamic.

    5.2.5. Huge Pages

    When large arrays are used, the page size of virtual memory should be increased above the default of 4 KBytes. Otherwise, there will be too much memory access latency due to frequent table lookup of virtual memory placement. For this reason, the module "craype-hugepages2M" is loaded automatically. For programs with very large arrays, even larger pages are available. The choices can be seen using the command:

    module avail craype-hugepages

    The choice of craype_hugepages* module should be consistent between compiling and running the executable.

    Huge pages are not supported for a job executing on a login node or a batch node. If the executable is serial code and is run on a login node or batch node, e.g. a short runtime preprocessing executable, and it was compiled with the compiler wrappers (ftn, cc or CC), then there will be a warning that huge pages are not available. In other words, the craype-hugepages2M module on the login nodes is to support compiling and linking for compute nodes, but not for execution on login nodes. To avoid the warning message, use the command:

    module unload craype-hugepages2M

    5.3. Relevant Modules

    By default, Onyx loads the Cray programming environment for you. The Intel and GNU environments are also available. To use either of these, the Cray environment must be unloaded and replaced with the one you wish to use. To do this, use the "module swap" command as follows:

    module swap PrgEnv-cray PrgEnv-intel         ## To switch to Intel
    module swap PrgEnv-cray PrgEnv-gnu           ## To switch to GNU

    In addition to the compiler suites, all of these modules also load the MPICH and LibSci modules. The MPICH module initializes MPI. The LibSci module, cray-libsci, includes solvers and single-processor and parallel routines that have been tuned for optimal performance on Cray XC systems (BLAS, LAPACK, ScaLAPACK, etc.). For additional information on the MPICH and LibSci modules, see the intro_mpi and intro_libsci man pages on Onyx.

    The table below shows the naming convention for various programming environment modules.

    Programming Environment Modules
    Module Module Name
    Cray CCEPrgEnv-cray
    IntelPrgEnv-intel
    GNUPrgEnv-gnu

    Under each programming environment, the compiler version can be changed. With the default Cray programming environment, for example, the compiler version can be changed from the default to version 8.5.6 with this command:

    module swap cce cce/8.5.6

    Use the "module avail" command to see all the available compiler versions for Cray CCE, Intel, and GNU.

    A number of Cray-optimized libraries (e.g., FFTW, HDF5, NetCDF, and PETSc) are available on Onyx with associated module files to set-up the necessary environment. As the environment depends on the active PrgEnv-* module, users should load library-related module files after changing the PrgEnv-*module.

    When using SHMEM, load the cray-shmem module, as follows:

    module load cray-shmem

    For more information on using modules, see the Modules User Guide.

    5.4. Libraries

    Cray's Scientific and Math Libraries and Intel's Math Kernel Library (MKL) are both available on Onyx. In addition, an extensive suite of math and science libraries are available in the $COST_HOME directory.

    5.4.1. Cray Scientific and Math Libraries (CSML, also known as LibSci)

    Onyx provides the Cray Scientific and Math Libraries (CSML) as part of the modules that are loaded by default. These libraries are a collection of single-processor and parallel numerical routines that have been tuned for optimal performance on Cray XC systems. The CSML contain optimized versions of the BLAS math routines as well as a host of tuned numerical routines for common mathematical computations. Users can utilize the CSML routines, instead of the public domain or user written versions, to optimize application performance on Onyx.

    Users familiar with Cray LibSci on XT and older XC systems should note that Cray LibSci on Onyx is quite different.

    There are serial, MPI, and multithreaded versions of Cray LibSci. Use the command "man intro_libsci" for more information. The correct routines in the LibSci module (cray-libsci) are automatically included when using the ftn, cc, or CC commands. You do not need to use the "-l sci" flag in your compile command line.

    LibSci includes the following:

    • Basic Linear Algebra Subroutines (BLAS) - Levels 1, 2, and 3
    • CBLAS (C interface to the legacy BLAS)
    • Linear Algebra Package (LAPACK)
    • Basic Linear Algebra Communication Subprograms (BLACS)
    • Scalable LAPACK (ScaLAPACK) (distributed-memory parallel set of LAPACK routines)
    and two libraries unique to Cray:
    • Iterative Refinement Toolkit (IRT)
    • CrayBLAS (a library of BLAS routines autotuned for the Cray XC series).

    The IRT routines may be used by setting the environment variable $IRT_USE_SOLVERS to 1, or by coding an explicit call to an IRT routine. More information is available by using the "man intro_irt" command.

    Other components of CSML are as follows:

    • Fourier Transformations (FFTW versions 2 and 3)
    • PETSc (Portable. Extensible, Toolkit for Scientific Computation)
    • Trilinos
    • Cray LibSci_ACC

    The FFTW routines are not loaded by default. Access to FFTW releases 2 and 3 is available by loading the module fftw/2.1.5.9 or fftw/3.3.4.11.

    Use the command "module avail cray-petsc" to see the available versions. Use "man intro_petsc" for more information on PETSc contents and use.

    The Trilinos Project package developed by Sandia National Laboratories is available by loading the module cray-trilinos. More information is available by using "man intro_trilinos".

    Cray supplies a collection of third party scientific libraries (TPSL) and solvers by loading the module cray-tpsl or cray-tpsl-64. The libraries contain the following:

    • MUMPS
    • ParMetis
    • SuperLU
    • SuperLU_DIST
    • Hypre
    • Scotch
    • Sundials

    TPSL are required to support the PETSc libraries, and the appropriate TPSL module is loaded automatically when the PETSc module is loaded. TPSL is also required to support Trilinos. Development of Trilinos is asynchronous from PETSc, so always use the version of TPSL that is loaded automatically when Trilinos is used.

    Cray LibSci_ACC library contains BLAS, LAPACK and ScaLAPACK routines optimized for GPU accelerators. After loading the module craype-accel-nvidia60, more information is available with the command "man intro_libsci_acc".

    5.4.2. Intel Math Kernel Library (Intel MKL)

    Onyx provides the Intel Math Kernel Library (Intel MKL), a set of numerical routines tuned specifically for Intel platform processors and optimized for math, scientific, and engineering applications. The routines, which are available via both FORTRAN and C interfaces, include:

    • LAPACK plus BLAS (Levels 1, 2, and 3)
    • ScaLAPACK plus PBLAS (Levels 1, 2, and 3)
    • Fast Fourier Transform (FFT) routines for single-precision, double-precision, single-precision complex, and double-precision complex data types
    • Discrete Fourier Transforms (DFTs)
    • Fast Math and Fast Vector Library
    • Vector Statistical Library Functions (VSL)
    • Vector Transcendental Math Functions (VML)

    The MKL routines are part of the Intel Programming Environment as Intel's MKL is bundled with the Intel Compiler Suite.

    Linking to the Intel Math Kernel Libraries can be complex and is beyond the scope of this introductory guide. Documentation explaining the full feature set along with instructions for linking can be found at the Intel Math Kernel Library documentation page.

    Intel also makes a link advisor available to assist users with selecting proper linker and compiler options: http://software.intel.com/sites/products/mkl/.

    5.4.3. Additional Math Libraries

    There is also an extensive set of Math libraries available in the $COST_HOME directory (/p/app/unsupported/COST) on Onyx. The modules for accessing these libraries are also available in the default module path. Information about these libraries may be found on the Baseline Configuration website at BC policy FY13-01.

    5.5. Debuggers

    Onyx provides the TotalView debugger and the DDT Debugger to assist users in debugging their code.

    5.5.1. TotalView and DDT

    TotalView is a debugger that supports threads, MPI, OpenMP, C/C++, and Fortran, mixed-language codes, advanced features like on-demand memory leak detection, other heap allocation debugging features, and the Standard Template Library Viewer (STLView). Unique features like dive, a wide variety of breakpoints, the Message Queue Graph/Visualizer, powerful data analysis, and control at the thread level are also available.

    DDT is a debugger that supports threads, MPI, OpenMP, C/C++, and Fortran, Coarray Fortran, UPC, and CUDA. Memory debugging and data visualization are supported for large-scale parallel applications. The Parallel Stack Viewer is a unique way to see the program state of all processes and threads at a glance.

    TotalView

    Follow the steps below to use TotalView on Onyx via a UNIX X-Windows interface.

    1. Ensure that an X server is running on your local system. Linux users will likely have this by default, but MS Windows users will need to install a third party X Windows solution. There are various options available. Currently, we recommend Xming.
    2. For Linux users, connect to Onyx using "ssh -Y". Windows users will need to use PuTTY with X11 forwarding enabled (Connection->SSH->X11->Enable X11 forwarding).
    3. Compile your program on Onyx with the "-g" option.
    4. Submit an interactive job:

      qsub -l select=1:ncpus=44:mpiprocs=44 -A Project_ID -l walltime=00:30:00 -q debug -l application=Application_Name -X -I

      Once your job has been scheduled, you will be logged into an interactive batch session on a service node that is shared with other users.

    5. Load the TotalView module:

      module load totalview
    6. Start program execution:

      totalview aprun -a -n 4 ./my_mpi_prog.exe arg1 arg2 ...

    7. After a short delay, the TotalView windows will pop up. Click "GO" and then "Yes" to start program execution.

    An example of using TotalView can be found in $SAMPLES_HOME/Programming/Totalview_Example on Onyx. For more information on using TotalView, see the TotalView Documentation page.

    DDT

    To use DDT on Onyx, follow steps 1 through 4 as for TotalView above, but load and use the DDT debugger instead.

    1. Load the DDT module:

      module load ddt

    2. Start program execution:

      ddt -n 4 ./my_mpi_prog.exe arg1 arg2 ...

    3. The DDT window will pop up. Verify the application name and number of MPI processes. Click "Run".

    An example of using DDT can be found in $SAMPLES_HOME/Programming/DDT_Example on Onyx. For more information on using DDT, see the DDT User Guide.

    5.6. Code Profiling and Optimization

    Profiling is the process of analyzing the execution flow and characteristics of your program to identify sections of code that are likely candidates for optimization, which increases the performance of a program by modifying certain aspects for increased efficiency.

    We provide CrayPat to assist you in the profiling process. In addition, a basic overview of optimization methods with information about how they may improve the performance of your code can be found in Performance Optimization Methods (below).

    5.6.1. CrayPat

    CrayPat is an optional performance analysis tool used to evaluate program behavior on Cray supercomputer systems. The simplest approach is to use the module perftools-lite. Additional features can be had by loading the module perftools, instead of perftools-lite.

    The first step is to load the perftools-base module.

    module load perftools-base

    After loading perftools-base, the command

    module avail perftools

    will show all of the modules associated with Perftools. Moreover, after loading perftools-base, the following man pages will be available: intro_craypat, pat_build, pat_help, craypat_lite, grid_order, app2, and reveal. For additional information, see the Cray Performance Measurement and Analysis Tools User Guide (6.4.5) S-2376 on Cray's documentation website, https://pubs.cray.com.

    In order for the Perftools modules to have an effect, the Cray compiler wrappers (ftn, cc and CC) must be used. To use the "lite" versions, use the following commands:

    module load perftools-base
    module load perftools-lite

    Or, instead of perftools-lite, utilize perftools-lite-events, perftools-lite-gpu, perftools-lite-hbm, or perftools-lite-loops. Basic usage information can be found using "module help". For example,

    module help perftools-lite-loops

    or,

    man perftools-lite

    Using the "lite" versions, it is not necessary to change your Makefile or other build procedures -- though the source does need to be rebuilt after the Perftool-related modules have been loaded. After the executable, prog.exe for example, is run using aprun,

    aprun prog.exe [program-options]

    the standard output will have performance statistics, which are also written to the file prog.exe.rpt. The report includes the high water-mark of memory usage.

    If the non-lite version of Perftools is used, e.g.

    module load perftools-base
    module load perftools

    then the compile and linking stages should be separate, using "-c" for the compiling stage. This is because the object files (*.o files) need to be available. After generating an executable, the pat_build command is used to generate an instrumented executable that is launched with aprun. Two examples are:

    pat_build prog.exe
    pat_build -u -g mpi prog.exe

    Either command generates an executable, prog.exe+pat, which is launched with aprun. See the Cray Performance Measurement and Analysis Tools User Guide (6.4.5) S-2376 to learn about the many options and implementation details. For example, sampling or tracing are available. As another example, the Automatic Program Analysis (APA) feature results in a text file being generated by the run which contains suggestions for pat_build tracing options.

    After a run, an output file with suffix ".xf" is available, which can be used as input to the command pat_report. Performance can be visualized using the command app2. Moreover, the GUI reveal is a tool developed by Cray to help with developing the hybrid MPI/OpenMP programming model.

    5.6.2. Additional Profiling Tools

    There is also a set of profiling tools available in the $COST_HOME directory (/p/app/unsupported/COST) on Onyx. The modules for accessing these libraries are also available in the default module path. Information about these libraries may be found on the Baseline Configuration website at BC policy FY13-01.

    5.6.3. Program Development Reminders

    If an application is not programmed for distributed memory, then only the cores on a single node can be used. This is limited to 44 cores on Onyx.

    Keep the system architecture in mind during code development. For instance, if your program requires more memory than is available on a single node, then you will need to parallelize your code so that it can function across multiple nodes.

    5.6.4. Compiler Optimization Options

    The "-Olevel" option enables code optimization when compiling. The level that you choose (0-4) will determine how aggressive the optimization will be. Increasing levels of optimization may increase performance significantly, but you should note that a loss of precision may also occur. There are also additional options that may enable further optimizations. The following table contains the most commonly used options.

    Compiler Optimization Options
    Option Description Compiler Suite
    -O0 No Optimization. (default in GNU) All
    -O1 Scheduling within extended basic blocks is performed. Some register allocation is performed. No global optimization. All
    -O2 Level 1 plus traditional scalar optimizations such as induction recognition and loop invariant motion are performed by the global optimizer. Generally safe and beneficial. (default in Cray & Intel) All
    -O3 Levels 1 and 2 plus more aggressive code hoisting and scalar replacement optimizations that may or may not be profitable. Generally beneficial. All
    -fipa-* The GNU compilers automatically enable IPA at various -O levels. To set these manually, see the options beginning with -fipa in the gcc man page. GNU
    -O ipan Specifies various levels of inlining (n=0-5) (default: n=3) Cray
    -O vectorn Specifies various levels of vectorization (n = 0-3) (default: n=2) Cray
    -finline-functions Enables function inlining within a single file Intel
    -ipon Enables interprocedural optimization between files and produces up to n object files Intel
    -inline-level=n Number of levels of inlining (default: n=2) Intel
    -ra Creates a listing file with optimization info Cray
    -opt-reportn Generate optimization report with n levels of detail Intel
    5.6.5. Performance Optimization Methods

    Performance can be optimized through several means, not all of which require rewriting code. For example, core affinity, the manner in which processes are assigned to cores, can be optimized from the aprun command line. Some code optimization can be achieved simply by changing compiler flags. Even greater performance improvement may come from altering the code. The latter two techniques may increase compilation time, executable size, and debugging difficulty. However, optimization can result in code that runs significantly faster. The optimizations that you can use will vary depending on your code and the system on which you are running.

    Note: Before considering optimization, you should always ensure that your code runs correctly and produces valid output.

    Some of the core affinity optimizations include:

    • Hybrid and Threaded Code Considerations
    • More Memory Per Process
    • Core Specialization

    In general, there are four main categories of code optimization:

    • Global Optimization
    • Loop Optimization
    • Interprocedural Analysis and Optimization(IPA)
    • Function Inlining
    Global Optimization

    A technique that looks at the program as a whole and may perform any of the following actions:

    • Performed on code over all its basic blocks
    • Performs control-flow and data-flow analysis for an entire program
    • Detects all loops, including those formed by IF and GOTOs statements and performs general optimization.
    • Constant propagation
    • Copy propagation
    • Dead store elimination
    • Global register allocation
    • Invariant code motion
    • Induction variable elimination
    Loop Optimization

    A technique that focuses on loops (for, while, etc.,) in your code and looks for ways to reduce loop iterations or parallelize the loop operations. The following types of actions may be performed:

    • Vectorization - rewrites loops to improve memory access performance. Some compilers may also support automatic loop vectorization by converting loops to utilize low-level hardware instructions and registers if they meet certain criteria.
    • Loop unrolling - (also known as "unwinding") replicates the body of loops to reduce loop branching overhead and provide better opportunities for local optimization.
    • Parallelization - divides loop operations over multiple processors where possible.
    Interprocedural Analysis and Optimization (IPA)

    A technique that allows the use of information across function call boundaries to perform optimizations that would otherwise be unavailable.

    Function Inlining

    A technique that seeks to reduce function call and return overhead. It:

    • Is used with functions that are called numerous times from relatively few locations.
    • Allows a function call to be replaced by a copy of the body of that function.
    • May create opportunities for other types of optimization
    • May not be beneficial. Improper use may increase code size and actually result in less efficient code.

    6. Batch Scheduling

    6.1. Scheduler

    The Portable Batch System (PBS) is currently running on Onyx. It schedules jobs and manages resources and job queues, and can be accessed through the interactive batch environment or by submitting a batch request. PBS is able to manage both single-processor and multiprocessor jobs. The PBS module is automatically loaded for you when you log in.

    6.2. Queue Information

    The following table describes the PBS queues available on Onyx. Jobs with high, frontier, and standard priority are handled differently depending on the requested walltime and core count.

    Users should submit directly to high, frontier, or standard queues, which are routing queues. Jobs will be moved automatically into the appropriate large job "_lg", small job "_sm", or long walltime "_lw" queues.

    Job priority starts at an initial value based on core count and the queue to which the job was submitted. It then increases for each hour that the job has been waiting to run.

    Queue Descriptions and Limits
    Priority Queue
    Name
    Job
    Class
    Max Wall
    Clock Time
    Max
    Jobs
    Min Cores
    Per Job
    Max Cores
    Per Job
    Comments
    Highest urgent Urgent 24 Hours N/A 22 7,260 Designated urgent jobs by DoD HPCMP
    Down Arrow for decreasing priority test N/A 24 Hours N/A 22 N/A Staff-only testing
    debug Debug 1 Hour 4 22 11,484 User testing
    HIE Debug 24 Hours 2 22 110 HPC Interactive Environment
    high_lg High 24 Hours 2 8,449 105,820 Designated high-priority jobs by Service/Agency (large jobs)
    high_sm High 24 Hours 70 22 8,448 Designated high-priority jobs by Service/Agency (small jobs)
    high_lw High 168 Hours 3 22 10,824 Designated high-priority jobs by Service/Agency (long walltime)
    frontier_lg Frontier 24 Hours 2 7,261 143,968 Frontier projects only (large jobs)
    frontier_sm Frontier 48 Hours 70 22 7,260 Frontier projects only (small jobs)
    frontier_lw Frontier 168 Hours 15 22 15,708 Frontier projects only (long walltime)
    frontier_md Frontier 96 Hours 2 15,709 34,540 Frontier projects only (medium sized, long walltime)
    standard_lg Standard 24 Hours 2 7,261 105,820 Normal priority jobs (large jobs)
    standard_sm Standard 24 Hours 70 22 7,260 Normal priority jobs (small jobs)
    standard_lw Standard 168 Hours 3 22 5,808 Normal priority jobs (long walltime)
    transfer N/A 48 Hours 6 1 1 Data transfer jobs. Access to the long-term storage.
    Lowest background Background 4 Hours 6 22 7,260 Unrestricted access - no allocation charge.

    6.3. Interactive Logins

    When you log in to Onyx, you will be running in an interactive shell on a login node. The login nodes provide login access for Onyx and support such activities as compiling, editing, and general interactive use by all users. Please note the Login Node Abuse policy. The preferred method to run resource-intensive executions is to use Cluster Compatibility Mode from within an interactive batch session.

    6.4. Interactive Batch Sessions

    To get an interactive batch session, you must first submit an interactive batch job through PBS. This is done by executing a qsub command with the "-I" option from within the interactive login environment. For example:

    qsub -l select=N1:ncpus=44:mpiprocs=N2 -A Project_ID -q queue_name -l walltime=HHH:MM:SS -l application=Application_Name -I

    You must specify the number of nodes requested (N1), the number of processes per node (N2), the desired maximum walltime, your project ID, and a job queue. Valid values for N2 are 1, 2, 4, 11, 22, and 44.

    Your interactive batch sessions will be scheduled just as normal batch jobs are scheduled depending on the other queued batch jobs, so it may take quite a while. Once your interactive batch shell starts, it will be running on a service node that is shared by other users. At this point, you can launch parallel applications onto your assigned set of compute nodes by using the aprun command. You can also run interactive commands or scripts on this service node, but you should limit your memory and cpu usage. Use the Cluster Compatibility Mode for executing memory- and process-intensive commands such as tar and gzip/gunzip and certain serial applications directly on a dedicated compute node.

    6.5. Cluster Compatibility Mode (CCM)

    You can also request direct access to a compute node by including the "ccm" option in your PBS interactive batch job submission. For example:

    qsub -l ccm=1 -l select=N1:ncpus=44:mpiprocs=N2 -A Project_ID -q queue_name -l walltime=HHH:MM:SS -l application=Application_Name -I

    You must specify the number of nodes requested (N1) and the number of processes per node (N2), the desired maximum walltime, your project ID and a job queue.

    Once scheduled by the PBS scheduler, you will again have an interactive shell session on a shared service node. Then, issue the ccmlogin command, and you will be logged onto the first compute node in the set of nodes to which you have been assigned. Your environment will react much the same as a normal shared service node. However, you will now have dedicated access to the entire compute node which will allow you to run serial applications as well as memory- and process-intensive commands such as tar and gzip/gunzip without affecting other users.

    6.6. Batch Request Submission

    PBS batch jobs are submitted via the qsub command. The format of this command is:

    qsub [ options ] batch_script_file

    qsub options may be specified on the command line or embedded in the batch script file by lines beginning with "#PBS".

    For a more thorough discussion of PBS batch submission on Onyx, see the Onyx PBS Guide.

    6.7. Batch Resource Directives

    Batch resource directives allow you to specify to PBS how your batch jobs should be run and what resources your job requires. Although PBS has many directives, you only need to know a few to run most jobs.

    The basic syntax of PBS directives is as follows:

    #PBS option[[=]value]

    where some options may require values to be included. For example, to start a 22-process job, you would request one node of 44 cores and specify that you will be running 22 processes per node:

    #PBS -l select=1:ncpus=44:mpiprocs=22

    The following directives are required for all jobs:

    Required PBS Directives
    Directive Value Description
    -A Project_ID Name of the project
    -q queue_name Name of the queue
    -l select=N1:ncpus=44:mpiprocs=N2 Standard compute node:
    N1 = Number of nodes
    N2 = MPI processes per node
    (N2 must be: 1, 2, 4, 11, 22, or 44)
    -l select=N1:ncpus=22:mpiprocs=N2:ngpus=1 GPU node:
    N1 = Number of nodes
    N2 = MPI processes per node
    (N2 must be: 1, 2, 11, or 22)
    -l select=N1:ncpus=44:mpiprocs=N2:bigmem=1 Large-memory node:
    N1 = Number of nodes
    N2 = MPI processes per node
    (N2 must be: 1, 2, 4, 11, 22, or 44)
    -l select=N1:ncpus=64:mpiprocs=N2:nmics=1:aoe=M_C

    select=N1:ncpus=64:mpiprocs=N2:nmics=1
    Knights Landing node:
    N1 = Number of nodes
    N2 = MPI processes per node
    (N2 must be: 1, 2, 4, 8, 16, 32, or 64)
    M = cluster configuration (snc4 or quad)
    C = percent of MCDRAM as cache (0, 50, or 100)
    Without aoe, default is quad_100
    -l walltime=HHH:MM:SS Maximum wall time

    A more complete listing of batch resource directives is available in the Onyx PBS Guide.

    The KNL configuration determined by cluster modes "quad" or "snc4", and memory modes "0", "50," or "100" are described in the Onyx KNL Quick Start Guide. To summarize, for the "quad" mode, each KNL node acts as one NUMA node; whereas, for the "snc4" mode, each KNL node consists of four NUMA nodes. Memory access may be a faster when each NUMA node uses the memory port nearest the cores of the NUMA node, which can occur when the cluster mode is "snc4". The memory modes can have a significant impact on performance. The high-bandwidth MCDRAM memory can act transparently as cache for all main memory accesses (mode "100") or can act as a separate memory (mode "0"). The latter is called "flat" memory. For mode "50", half of the MCDRAM serves as cache and half serves as a separate memory. If any part of MCDRAM acts as a separate memory, then the user can utilize specialized memory allocation routines to specify which arrays use MCDRAM and which use the lower-bandwidth DDR main memory. See "man memkind" and "man hbwmalloc". The most commonly used options are "aoe=quad_100", "aoe=quad_0", and "aoe=snc4_100". If a program has been written to use MCDRAM as separate memory, e.g. "aoe=quad_0", and it is found that less than half of MCDRAM is actually being used, then the option of "aoe=quad_50" may improve performance.

    6.8. Launch Commands

    On Onyx the PBS batch scripts and the PBS interactive login session run on a service node, not a compute node. The only way to send your executable to the compute nodes is to use the aprun command. The following example command line could be used within your batch script or in a PBS interactive session, sending the executable ./a.out to 88 compute cores.

    aprun -n 88 ./a.out

    Common Options for Aprun
    Option Description
    -A # The total number of MPI processes.
    -N # The number of MPI processes to place per node.
    Useful for getting more memory per MPI process.
    -d # The number of threads per node in OpenMP.
    May also be used to processes on cores.
    -B Directs aprun to get values for -n, -N, and -d from PBS directives
    instead of from the aprun command line. Simplifies and saves time.
    -S # The number of MPI processes to place per NUMA.
    Useful for getting more L3 cache per process.
    -j 2 Run using hyperthreads. (The values of 2 or 4 are permitted on KNL nodes.)

    For more in-depth discussion of the aprun options, consult the aprun man page or the Onyx PBS Guide.

    A serial executable can be sent to one compute node using aprun or ccmrun:

    aprun -n 1 serial_executable ## OR
    ccmrun serial_executable

    It is also possible to run a script on one compute node using ccmrun when the Cluster Compatibility Mode has been envoked (-l ccm=1).

    ccmrun script_to_run

    Use aprun to launch MPI, SHMEM, OpenMP, Hybrid MPI/OpenMP, and PGAS executables. For examples of this, see MPI, SHMEM, OpenMP, Hybrid MPI/OpenMP, and PGAS (above) or look in the $SAMPLES_HOME directory on Onyx. For more information about aprun, see the aprun man page.

    6.9. Sample Script

    While it is possible to include all PBS directives at the qsub command line, the preferred method is to embed the PBS directives within the batch request script using "#PBS". The following script is a basic example and contains all of the required directives, some frequently used optional directives, and common script components. It starts 88 processes on 2 nodes of 44 cores each, with one MPI process per core. More thorough examples are available in the Onyx PBS Guide and in the Sample Code Repository ($SAMPLES_HOME) on Onyx.

    #!/bin/bash
    ## The first line (above) specifies the shell to use for parsing the
    ## remaining lines of the batch script.
    
    ## Required PBS Directives --------------------------------------
    #PBS -A Project_ID
    #PBS -q standard
    #PBS -l select=2:ncpus=44:mpiprocs=44
    #PBS -l walltime=12:00:00
    
    ## Optional PBS Directives --------------------------------------
    #PBS -l application=Application_Name
    #PBS -N Test_Run_1
    #PBS -j oe
    #PBS -S /bin/bash
    #PBS -V
    #PBS -l ccm=1
    ## ccm=1 option for Cluster Compatibility Mode, including using
    ## executables compiled with Dynamic Shared Libraries 
    
    
    ## Execution Block ----------------------------------------------
    # Environment Setup
    # cd to your personal directory in the scratch file system
    cd ${WORKDIR}
    
    # create a job-specific subdirectory based on JOBID and cd to it
    JOBID=`echo ${PBS_JOBID} | cut -d '.' -f 1`
    if [ ! -d ${JOBID} ]; then
      mkdir -p ${JOBID}
    fi
    cd ${JOBID}
    
    # Launching -----------------------------------------------------
    # copy executable from $HOME and submit it
    cp ${HOME}/my_prog.exe .
    aprun -n 88 ./my_prog.exe > my_prog.out
    
    # Clean up ------------------------------------------------------
    # archive your results
    # Using the "here document" syntax, create a job script
    # for archiving your data.
    cd ${WORKDIR}
    rm -f archive_job
    cat >archive_job <<END
    #!/bin/bash
    #PBS -l walltime=12:00:00
    #PBS -q transfer
    #PBS -A Project_ID
    #PBS -l select=1:ncpus=1
    #PBS -l application=transfer
    #PBS -j oe
    #PBS -S /bin/bash
    cd ${WORKDIR}/${JOBID}
    archive mkdir -C ${ARCHIVE_HOME} ${JOBID}
    archive put -C ${ARCHIVE_HOME}/${JOBID} *.out
    archive ls ${ARCHIVE_HOME}/${JOBID}
    
    # Remove scratch directory from the file system.
    cd ${WORKDIR}
    rm -rf ${JOBID}
    END
    
    # Submit the archive job script.
    qsub archive_job

    Additional examples are available in the Onyx PBS Guide and in the Sample Code Repository ($SAMPLES_HOME) on Onyx.

    6.10. PBS Commands

    The following commands provide the basic functionality for using the PBS batch system:

    qsub: Used to submit jobs for batch processing.
    qsub [ options ] my_job_script

    qstat: Used to check the status of submitted jobs.
    qstat PBS_JOBID ## check one job
    qstat -u my_user_name ## check all of user's jobs

    qdel: Used to kill queued or running jobs.
    qdel PBS_JOBID

    A more complete list of PBS commands is available in the Onyx PBS Guide.

    6.11. Advance Reservations

    A subset of Onyx's nodes has been set aside for use as part of the Advance Reservation Service (ARS). The ARS allows users to reserve a user-designated number of nodes for a specified number of hours starting at a specific date/time. This service enables users to execute interactive or other time-critical jobs within the batch system environment. The ARS is accessible via most modern web browsers at https://reservation.hpc.mil. Authenticated access is required. The ARS User Guide is available on HPC Centers.

    7. Software Resources

    7.1. Application Software

    A complete listing of the software versions installed on Onyx can be found on our software page. The general rule for all COTS software packages is that the two latest versions will be maintained on our systems. For convenience, modules are also available for most COTS software packages.

    7.2. Useful Utilities

    The following utilities are available on Onyx:

    Useful Utilities
    UtilityDescription
    archive Perform basic file-handling operations on the archive system
    blocker Convert a file to fixed-record-length format
    bull Display the system bulletin board
    cal2jul Convert a date into the corresponding Julian day
    datecalc Print an offset from today's date in various formats
    extabs Expand tab characters to spaces in a text file
    getarchome Display the value of $ARCHIVE_HOME for a given userid
    getarchost Display the value of $ARCHIVE_HOST for a given userid
    justify Justify a character string with padding
    lss Show unprintable characters in file names
    mpibzip2 Parallel implementation of the bzip2 compression utility
    mpscp High-performance remote file copy
    node_use Display the amount of free and used memory for login nodes
    qlim Report current batch queue usages and limits.
    qpeek Display spooled stdout and stderr for an executing batch job.
    qview Display information about batch jobs and queues
    show_queues Report current batch queue status, usage, and limits
    show_resv Show status of advance reservations
    show_storage Display archive system allocation and usage by subproject
    show_usage Display CPU allocation and usage by subproject
    stripdos Strip DOS end-of-record control characters from a text file
    stripes Report the OST striping pattern of files in a Lustre file system
    tails Display the last five lines of one or more files
    trim Trim trailing blanks from text file lines
    vman Browse an online man page using the view command

    7.3. Sample Code Repository

    The Sample Code Repository is a directory that contains examples for COTS batch scripts, building and using serial and parallel programs, data management, and accessing and using serial and parallel math libraries. The $SAMPLES_HOME environment variable contains the path to this area, and is automatically defined in your login environment. Users should look in the $SAMPLES_HOME directory for simple examples. Examples and useful tips are periodically added to this directory.

    Sample Code Repository on Onyx
    Application_Name
    Use of the application name resource.
    Sub-DirectoryDescription
    application_namesREADME and list of valid strings for application names intended for use in every PBS script preamble. The HPCMP encourages applications not specifically named in the list to be denoted as "other".
    Applications
    Application-specific examples; interactive job submit scripts; software license use.
    Sub-DirectoryDescription
    ansysSample input files and job script for the ANSYS application.
    ls_dynaSample input files and job script for the LS-DYNA application.
    cfd++Sample input files and job script for the cfd++ application.
    star-ccm+Sample input files and job script for the Star-CCM+ application.
    matlabSample input file and job script for MATLAB.
    namdSample input files and job script for the NAMD application.
    nwchemSample input files and job script for the NWCHEM application.
    gaussianJob script for the Gaussian application.
    Data_Management
    Archiving and retrieving files; Lustre striping; file searching; $WORKDIR use.
    Sub-DirectoryDescription
    OST_StripesInstructions and examples for striping large files on the $WORKDIR Lustre file system.
    Parallel_Environment
    MPI, OpenMP, and hybrid examples; large memory jobs; packing nodes with single-core jobs; running multiple applications within a single PBS job.
    Sub-DirectoryDescription
    Hello_World_ExampleExample "Hello World" codes demonstrating how to compile and execute MPI, OpenMP, and hybrid MPI/OpenMP codes using PBS on the Broadwell compute nodes. Each paradigm is contained in the corresponding named subdirectory: MPI, OPENMP, HYBRID. Examples to compile and execute on KNL nodes are in $SAMPLES_HOME/Programming/KNL/.
    Parallel_IO
    Tools for performing parallel I/O.
    Sub-DirectoryDescription
    There are currently no samples available in this category.
    Programming
    Basic code compilation; debugging; use of library files; static vs. dynamic linking; Makefiles; Endian conversion.
    Sub-DirectoryDescription
    TotalView_ExampleHow to use TotalView to debug a small example code in an interactive PBS job.
    DDT_ExampleHow to use DDT to debug a small example code in an interactive PBS job.
    Memory_UsageA routine callable from Fortran or C for determining how much memory a process is using.
    Timers_FortranSerial Timers using Fortran Intrinsics from FORTRAN77 and Fortran 90/95.
    GPU/cudainfoProgram returns information about the NVIDIA device.
    GPU/vecaddExample of CUDA program for vector addition.
    GPU/matmultExample of CUDA program for matrix multiplication.
    GPU/openclExample of OpenCL for GPU.
    GPU/directExample of combining MPI and CUDA. Cray's implementation of MPICH2 allows GPU memory buffers to be passed directly to MPI function calls, eliminating the need to manually copy GPU data to the host before passing data through MPI.
    KNL/HYBRIDExample of Hybrid programming model using KNL. The program xthi.c shows the placement of processors on cores.
    User_Environment
    Use of modules; customizing the login environment; use of common environment variables to facilitate portability of work between systems
    Sub-DirectoryDescription
    Module_Swap_ExampleBatch script demonstrating how to use the module command to choose other programming environments.
    Workload_Management
    Basic batch scripting; use of the transfer queue; job arrays; job dependencies; Secure Remote Desktop; job monitoring; generating core/MPI process or core/MPI process-OpenMP thread associativity.
    Sub-DirectoryDescription
    BatchScript_ExampleSimple PBS batch script showing all required preamble statements and a few optional statements. More advanced batch script showing more optional statements and a few ways to set up PBS jobs. Description of the system hardware.
    Core_Info_ExampleDescription and C language routine suitable for Fortran and C showing how to determine the node and core placement information for MPI, OpenMP, and hybrid MPI/OpenMP PBS jobs. The example works in the Crya, GCC, and Intel programming environment, though Intel requires use of the KMP_AFFINITY environment variable.
    Interactive_ExampleC and Fortran code samples and scripts for running an interactive MPI job on Onyx.

    8. Links to Vendor Documentation

    Cray Home: http://pubs.cray.com
    XC Series Programming Environment User Guide (17.05) S-2529
    https://pubs.cray.com/content/00463350-DA/DD00397053
    CLE User Application Placement Guide S-2496-5204 S-2496
    https://pubs.cray.com/content/S-2496/CLE5.2.UP04/cle-user-application-placement-guide-s-2496-5204
    XC Series User Application Placement Guide (CLE 6.0.UP01) S-2496
    https://pubs.cray.com/content/S-2496/CLE%206.0.UP01/xctm-series-user-application-placement-guide-cle-60up01

    Novell Home: http://www.novell.com/linux
    Novell SUSE Linux Enterprise Server: http://www.novell.com/products/server

    GNU Home: http://www.gnu.org
    GNU Compiler: http://gcc.gnu.org/onlinedocs

    Intel Documentation: http://software.intel.com/en-us/intel-software-technical-documentation
    Intel Compiler List: http://software.intel.com/en-us/intel-compilers

    TotalView Documentation: http://www.roguewave.com/support/product-documentation/totalview.aspx
    DDT Tutorials: https://developer.arm.com/products/software-development-tools/hpc/arm-forge/arm-ddt/video-demos-and-tutorials-for-arm-ddt