Skip Nav

Cray XC40/50 (Onyx) Broadwell Process Placement

Table of Contents

1. Introduction

This document describes the aprun command options and environment variables that control placement of processes within a standard Onyx Broadwell compute node. The type of processes covered are MPI processes and OpenMP threads. One goal is to obtain process "pinning", for which a given process remains on the same physical core and thereby continues to use the same L1 and L2 cache. Another goal is to make certain that processes are uniformly distributed among the cores. The next section will cover the placement of MPI processes. Then, the placement of OpenMP threads will be described. Finally, the placement of processes of a hybrid MPI/OpenMP program will be described. In each section hyper-threading will be mentioned. Note there are 88 cores with regard to thread affinity -- 44 physical cores with cores 44-87 corresponding to hyper-threads on physical cores 0-43, respectively. The 44 physical cores on each node are comprised of two 22-core sockets, with each socket containing only one NUMA.

All of the following examples use the PBS select line of

#PBS -l select=1:ncpus=44:mpiprocs=44

even when hyper-threading is used. All examples are experiments on one node. If more nodes are selected, the "-N procs_per_node" option needs to be included in the aprun command line.

When the Cray compiler is used, setting the environment variable $CRAY_OMP_CHECK_AFFINITY to TRUE will result in thread placement information being printed to standard error when the program is run, for any parallel executable.

The program used to determine process placement was xthi.c. The program can be found on Onyx in the directory $SAMPLES_HOME/Parallel_Environment/Hello_World_Example/HYBRID.

Please see "man aprun" for more information on aprun arguments.

2. MPI Process Placement

For the placement of MPI processes, the aprun options have the same effect for programs compiled with either the Cray, Intel or GNU compiler. (In contrast, for OpenMP thread placement, the effect of aprun options and environment variables are different between the Cray or GNU compilers versus the Intel compiler.)

The default when no "-cc" option is specified is equivalent to "-cc cpu". For MPI processes, the process placement of "-cc depth" differs from "-cc cpu" only when the "-d" option is used. For "-cc depth" with the depth option greater than 1, the processes are not pinned. Since the processes are not pinned to specific cores, cache is not used effectively.

The placement rules are demonstrated in the following tables. For some tables, more than one aprun command line is shown due to more than one command having the same result.

With just "-cc cpu" the MPI processes are placed on sequential cores.

PrgEnv-cray, PrgEnv-intel or PrgEnv-gnu
aprun -n 4 -cc cpu ./xthi.x

Rank Core
0 0
1 1
2 2
3 3

Using "-S 2", two processes are placed on the second socket.

PrgEnv-cray, PrgEnv-intel or PrgEnv-gnu
aprun -n 4 -S 2 -cc cpu ./xthi.x

Rank Core
0 0
1 1
2 22
3 23

Using "-cc cpu" and "-d 11", processes are scattered over the cores.

PrgEnv-cray, PrgEnv-intel or PrgEnv-gnu
aprun -n 4 -d 11 -cc cpu ./xthi.x
aprun -n 4 -d 11 -S 2 -cc cpu ./xthi.x

Rank Core
0 0
1 11
2 22
3 33

Several different aprun commands will uniformly distribute MPI process over cores. Since "-d" is not specified, "-cc cpu" and "-cc depth" have the same result.

PrgEnv-cray, PrgEnv-intel or PrgEnv-gnu
aprun -n 44 -cc cpu ./xthi.x
aprun -n 44 -S 22 -cc cpu ./xthi.x
aprun -n 44 -cc depth ./xthi.x
aprun -n 44 -S 22 -cc depth ./xthi.x

Rank Core
0 0
1 1
... ...
42 42
43 43

Below are two aprun commands that use hyper-threading ("-j 2") to uniformly place one MPI process per hyper-thread.

PrgEnv-cray, PrgEnv-intel or PrgEnv-gnu
aprun -n 88 -j 2 -cc cpu ./xthi.x
aprun -n 88 -j 2 -S 44 -cc cpu ./xthi.x

Rank Core (Hyper-thread)
0 0 (0)
1 0 (44)
2 1 (1)
3 1 (45)
... ...
86 43 (43)
87 43 (87)

3. OpenMP Process Placement

3.1. Overview

For considering OpenMP process placement on Broadwell nodes, we set the number of MPI processes to 1 and vary the value of $OMP_NUM_THREADS. For OpenMP thread placement, the effect of aprun options and environment variables are the same for the Cray and GNU compilers but are different for the Intel compiler.

3.2. Cray and GNU Compilers

For Cray and GNU compilers, the best choice for OpenMP is "-cc cpu" with the appropriate "-d" option.

PrgEnv-cray or PrgEnv-gnu
export OMP_NUM_THREADS=4
aprun -n 1 -cc cpu -d 4 ./xthi.x

Thread Core
0 0
1 1
2 2
3 3

PrgEnv-cray or PrgEnv-gnu
export OMP_NUM_THREADS=44
aprun -n 1 -cc cpu -d 44 ./xthi.x

Thread Core
0 0
1 1
... ...
42 42
43 43

For 88 OpenMP threads, the best distribution utilizes "-j 2" and "-d 88".

PrgEnv-cray or PrgEnv-gnu
export OMP_NUM_THREADS=88
aprun -n 1 -cc cpu -d 88 -j 2 ./xthi.x

Thread Core (Hyper-thread)
0 0 (0)
1 0 (44)
2 1 (1)
3 1 (45)
... ...
86 43 (43)
87 43 (87)

3.3. Intel Compiler

For the Intel compiler, the best choice for OpenMP is "-cc depth" with the appropriate "-d" option combined with the environment variable $KMP_AFFINITY. $KMP_AFFINITY can be set to "compact" or "scatter". The difference between values can be seen when all cores on a node are used.

PrgEnv-intel
export OMP_NUM_THREADS=4
export KMP_AFFINITY=compact or export KMP_AFFINITY=scatter
aprun -n 1 -cc depth -d 4 ./xthi.x

Thread Core
0 0
1 1
2 2
3 3

When the number of OpenMP threads equals the number of cores, threads can be distributed evenly over the cores in two ways, depending on the value of $KMP_AFFINITY.

PrgEnv-intel
export OMP_NUM_THREADS=44
export KMP_AFFINITY=compact
aprun -n 1 -cc depth -d 44 ./xthi.x

Thread Core
0 0
1 1
... ...
42 42
43 43

PrgEnv-intel
export OMP_NUM_THREADS=44
export KMP_AFFINITY=scatter
aprun -n 1 -cc depth -d 44 ./xthi.x

Thread Core
0 0
1 22
2 1
3 23
... ...
42 21
43 43

For 88 OpenMP threads, the best distribution utilizes "-j 2" and "-d 88". The threads are not pinned to a single hyper-thread but instead mapped to pairs of hyper-threads.

PrgEnv-intel
export OMP_NUM_THREADS=88
export KMP_AFFINITY=compact
aprun -n 1 -cc depth -d 88 -j 2 ./xthi.x

Thread Core (Hyper-thread)
0 0 (0,44)
1 0 (0,44)
2 1 (1,45)
3 1 (1,45)
... ...
86 43 (43,87)
87 43 (43,87)

PrgEnv-intel
export OMP_NUM_THREADS=88
export KMP_AFFINITY=scatter
aprun -n 1 -cc depth -d 88 -j 2 ./xthi.x

Thread Core (Hyper-thread)
0 0 (0,44)
1 22 (22,66)
2 1 (1,45)
3 23 (23,67)
... ...
86 21 (21,67)
87 43 (43,87)

4. Hybrid MPI/OpenMP Process Placement

For hybrid testing, we will illustrate 11 MPI processes with 4 OpenMP threads and 4 MPI processes with 11 OpenMP threads. In addition, twice the number of threads will be considered with hyper-threading enabled.

For the Cray or GNU compiler, the "-cc cpu" option is needed, whereas for the Intel compiler, the "-cc depth" option needs to be used along with the environment variable $KMP_AFFINITY set to either "compact" or "scatter".

PrgEnv-cray or PrgEnv-gnu
export OMP_NUM_THREADS=4
aprun -n 11 -cc cpu -d 4 ./xthi.x

PrgEnv-intel
export OMP_NUM_THREADS=4
export KMP_AFFINITY=compact or export KMP_AFFINITY=scatter
aprun -n 11 -cc depth -d 4 ./xthi.x

Rank Thread Core
0 0 0
0 1 1
0 2 2
0 3 3
... ... ...
10 0 40
10 1 41
10 2 42
10 3 43

When hyper-threading is used, for the Cray or GNU compilers the assignment of physical cores to ranks follows the same pattern as without hyper-threading.

PrgEnv-cray or PrgEnv-gnu
export OMP_NUM_THREADS=8
aprun -n 11 -cc cpu -d 8 -j 2 ./xthi.x

Rank Thread Core (Hyper-thread)
0 0 0 (0)
0 1 0 (44)
0 2 1 (1)
0 3 1 (45)
... ... ...
10 4 42 (42)
10 5 42 (86)
10 6 43 (43)
10 7 43 (87)

For the Intel compiler, the threads are not pinned to a single hyper-thread but instead mapped to pairs of hyper-threads.

PrgEnv-intel
export OMP_NUM_THREADS=8
export KMP_AFFINITY=compact
aprun -n 11 -cc depth -d 8 -j 2 ./xthi.x

Rank Thread Core (Hyper-thread)
0 0 0 (0,44)
0 1 0 (0,44)
0 2 1 (1,45)
0 3 1 (1,45)
... ... ...
10 4 42 (42,86)
10 5 42 (42,86)
10 6 43 (43,87)
10 7 43 (43,87)

PrgEnv-intel
export OMP_NUM_THREADS=8
export KMP_AFFINITY=scatter
aprun -n 11 -cc depth -d 8 -j 2 ./xthi.x

Rank Thread Core (Hyper-thread)
0 0 0 (0,44)
0 1 1 (1,45)
0 2 2 (2,46)
0 3 3 (3,47)
0 4 0 (0,44)
0 5 1 (1,45)
0 6 2 (2,46)
0 7 3 (3,47)
... ... ...
10 0 40 (40,84)
10 1 41 (41,85)
10 2 42 (42,86)
10 3 43 (43,87)
10 4 40 (40,84)
10 5 41 (41,85)
10 6 42 (42,86)
10 7 43 (43,87)

Now we will consider the opposite hybrid example to illustrate fewer MPI processes combined with more OpenMP threads.

PrgEnv-cray or PrgEnv-gnu
export OMP_NUM_THREADS=11
aprun -n 4 -cc cpu -d 11 ./xthi.x

PrgEnv-intel
export OMP_NUM_THREADS=11
export KMP_AFFINITY=compact or export KMP_AFFINITY=scatter
aprun -n 4 -cc depth -d 11 ./xthi.x

Rank Thread Core (Hyper-thread)
0 0 0 (0)
0 1 1 (1)
0 2 2 (2)
0 3 3 (3)
... ... ...
3 7 40 (40)
3 8 41 (41)
3 9 42 (42)
3 10 43 (43)

When hyper-threading is used, for the Cray or GNU compilers the assignment of physical cores to ranks follows the same pattern as without hyper-threading.

PrgEnv-cray or PrgEnv-gnu
export OMP_NUM_THREADS=22
aprun -n 4 -cc cpu -d 22 -j 2 ./xthi.x

Rank Thread Core (Hyper-thread)
0 0 0 (0)
0 1 0 (44)
0 2 1 (1)
0 3 1 (45)
... ... ...
3 18 42 (42)
3 19 42 (86)
3 20 43 (43)
3 21 43 (87)

For the Intel compiler, the threads are not pinned to a single hyper-thread but instead mapped to pairs of hyper-threads.

PrgEnv-intel
export OMP_NUM_THREADS=22
export KMP_AFFINITY=compact
aprun -n 4 -cc depth -d 22 -j 2 ./xthi.x

Rank Thread Core (Hyper-thread)
0 0 0 (0,44)
0 1 0 (0,44)
0 2 1 (1,45)
0 3 1 (1,45)
... ... ...
3 18 42 (42,86)
3 19 42 (42,86)
3 20 43 (43,87)
3 21 43 (43,87)

PrgEnv-intel
export OMP_NUM_THREADS=22
export KMP_AFFINITY=scatter
aprun -n 4 -cc depth -d 22 -j 2 ./xthi.x

Rank Thread Core (Hyper-thread)
0 0 0 (0,44)
0 1 1 (1,45)
0 2 2 (2,46)
0 3 3 (3,47)
... ... ...
0 11 0 (0,44)
0 12 1 (1,45)
0 13 2 (2,46)
0 14 3 (3,47)
... ... ...
3 18 40 (40,84)
3 19 41 (41,85)
3 20 42 (42,86)
3 21 43 (43,87)