Skip Nav

Cray XC40/50 (Onyx)
KNL Quick Start Guide

Table of Contents

1. Knights Landing (KNL) Nodes

The Knights Landing (KNL) nodes on Onyx are similar to the standard compute nodes which have Broadwell processors. The Knights Landing Xeon Phi is not an attached coprocessor, unlike the Knights Corner found on Armstrong, Conrad, Gordon and Thunder. On the Knights Landing (KNL) nodes either MPI, OpenMP or Hybrid (MPI with OpenMP) programming models can be used.

A comparison of the floating-point capabilities of the standard compute nodes with Xeon Broadwell processes versus the KNL nodes shows the need for vectorizable code. The KNL nodes have 64 cores per node, and each core has two vector processing units (VPUs), each with 8 double-precision lanes. In contrast, each standard processor node has 44 cores (two Broadwell sockets), and each core has 4 double-precision lanes for vector processing. The clock speed of KNL cores is 1.3 GHz whereas the Broadwell cores have a clock speed of 2.8 GHz. Hence, in order to take advantage of the KNL nodes, a code should be highly vectorizable. In most cases, MPI without multithreading (without OpenMP) gives high performance using 64 ranks per node. Hybrid (MPI with OpenMP) programs can improve performance when attention is given to data locality between cores and memory (both main memory and cache).

Another feature of the KNL nodes is that in addition to 96 GBytes of main memory (called DDR memory), there is available 16 GBytes of high-bandwidth memory (called MCDRAM memory). More information is given in section 3 of this document.

This KNL Quick Start Guide will not describe optimization techniques in detail. The KNL architecture and program optimization techniques will be briefly described in this document for the purpose of motivating the compilation and runtime options.

2. How to Compile and Submit a KNL Job

Programs that run on the standard compute nodes can run on the KNL nodes with no changes, though they need to be recompiled. It is necessary to change the programming environment using the command:

module swap craype-broadwell craype-mic-knl

The user does not need to set any special compiler flags when using the compiler wrappers ftn, cc, and CC. With the craype-mic-knl module, compiler wrappers will have the appropriate KNL flags for all three overall programming environments: PrgEnv-cray, PrgEnv-intel, and PrgEnv-gnu.

In addition, a huge page module should be loaded. This is particularly important for large arrays so that array access does not often cross a virtual memory boundary. For static executables, the choice of page size at runtime (specified in the job script) can be different from the page size during compilation. For dynamic executables (which use shared libraries), the page size should be consistent between compilation and runtime. The default page when compiling on the login nodes is 4 KBytes. It is expected that the default page size on the KNL nodes will be 2 MBytes, corresponding to module craype-hugepages2M. In which case, before compiling use the command:

module load craype-hugepages2M

The user can try other page sizes for compilation and runtime. The page sizes range from 2 MBytes to 512 MBytes. The available page sizes can be seen with the command:

module avail craype-hugepages

For the Cray compiler, which is the default, the module and compile/link commands are:

module swap craype-broadwell craype-mic-knl
module load craype-hugepages2M
cc -o my_prog.exe my_c_code.c
CC -o my_prog.exe my_c++_code.cpp
ftn -o my_prog.exe my_fortran_code.F90

For the Intel compiler, the module and compile/link commands are:

module swap PrgEnv-cray PrgEnv-intel
module swap craype-broadwell craype-mic-knl
module load craype-hugepages2M
cc -o my_prog.exe -qopenmp my_c_code.c
CC -o my_prog.exe -qopenmp my_c++_code.cpp
ftn -o my_prog.exe -qopenmp my_fortran_code.F90

In order to request a KNL node, the PBS job script should have the following line:

#PBS -l select=NUMNODES:ncpus=64:mpiprocs=64:nmics=1

A complete example can be found in the directory $SAMPLES_HOME/Programming/KNL/HYBRID.

3. High Bandwidth MCDRAM

A key feature of the KNL processor is the availability of 16 GBytes of high-bandwidth memory, called MCDRAM memory. In addition, each node has 90 GBytes of DDR memory. (Each node has 96 GBytes of physical DDR memory, of which approximately 90 GBytes is available for user programs.) The latency of MCDRAM and DDR are similar. DDR memory has of latency of around 125 milliseconds, whereas MCDRAM memory has a latency of around 150-175 milliseconds, depending on the memory mode. On the other hand, MCDRAM memory can have a bandwidth as much as five times DDR memory (when used in flat mode).

Several memory modes are available, the default is "cache", for which the MCDRAM is not visible to the program, but serves as cache for the DDR memory. In this mode, the effective bandwidth of main memory accesses is enhanced for data cached in MCDRAM. The MCDRAM is not a low-latency L3 cache. The KNL processors only have L1 and L2 cache, and no L3 cache. Each core has a 32-KByte L1 cache, and every pair of cores (called a "tile"), share a 1-MByte L2 cache.

Another memory mode is "flat", for which the MCDRAM is addressable and does not serve as cache for the DDR memory. In flat mode, the most heavily used arrays can be assigned to the MCDRAM memory.

When OpenMP is used, the memory must be allocated in a bank near the core that will use the contents of the allocated memory. Described briefly, memory is actually allocated when first touched, that is, assigned values. So the same OpenMP threads that initialize memory should be the ones that use the memory.

4. Cluster Modes

The default cluster mode on Onyx is "quadrant". Despite the name implying four parts, a KNL node in quadrant mode is a single Non-Uniform Memory Access (NUMA) node. That is, memory access speed is uniform over cores in the quadrant mode. (The "quadrant" term refers to the organization of tags that indicate which memory port to use for a given memory page, an aspect that is transparent to the user.) On the other hand, the cluster mode "SNC4" results in four NUMA nodes per KNL node. The cluster mode SNC2 (two NUMA nodes) is not available on Onyx in order to reduce the amount of mode switching.

The aprun option "-S N" specifies the number of processes, N, per NUMA node. For 64 MPI processes per KNL node, in SNC4 mode, the processes can be uniformly distributed using the following aprun command.

aprun -n total-ranks -N 64 -S 16[ ... ]

5. Memory Modes

The default memory mode on Onyx is "cache". A node can be rebooted into three memory modes, 100%, 50%, or 0% of MCDRAM dedicated to cache. (A fourth memory mode, 25%, is not available on Onyx in order to reduce the amount of mode switching.) The last mode is called "flat". Many applications run well with 100% cache. If all of the data of an application is less than 16 GBytes, then the flat mode can improve performance. For cluster mode "quadrant", to have all data in MCDRAM for non-cache mode, use the following aprun option:

aprun [ ... ] numactl --membind=1 ./myprog.exe

When "membind" is used, the program will fail if more than 16 GBytes are needed. If the program might use more than 16 GBytes, yet use of MCDRAM is preferred, the following aprun option can be used:

aprun [ ... ] numactl --preferred=1 ./myprog.exe

The argument of "1" for "--membind" or "--preferred" refers to MCDRAM. The DDR memory would be referred-to by an argument of "0". The portion of memory assigned to cache is invisible to numactl. For SNC4, the memory partitions are more complex and will not be described in the quick start guide. The memory partitions can be seen using the command:

numactl --hardware (numactl -H)

Recall that for a Cray supercomputer, an interactive PBS job does not put the user onto a compute node, but rather, puts the user on a batch node. Similarly, a batch script is interpreted by a batch node. So the command "numactl -H" that is run in an interactive session or a batch script would not show you the configuration of a KNL node. To obtain information about a KNL compute node, use

aprun -n 1 -b numactl -H

The "-b" option means that the executable "numactl" is found on the KNL node, and does not need to be transferred from the file system.

6. Thread Distribution Among Cores

Several examples will be given of aprun options that show how — when using the hybrid programming model — MPI ranks and OpenMP threads can be distributed among cores. The KNL processor can support up to four hyper-threads per core. The first 64 hyper-threads are numbered 0-63. These are hardware threads, which are different from OpenMP threads. The hyper-threads can also be considered cores. The operating system sees 4 x 64 = 256 cores numbered 0-255. (Try, for example, the command cat /proc/cpuinfo.) Hyper-thread 64 (core 64) runs on the first core, sharing resources with hyper-thread 0. Similarly, hyper-thread 127 runs on the last core and shares resources with hyper-thread 63. The third and fourth sets of hyper-threads are numbered 128-191 and 192-255, respectively. In order to use two-way hyper-threading, the aprun option is "-j2". To use all four hyper-threads, the aprun option is "-j4". If all 64 cores are active, two-way or four-way hyper-threading may not be efficient because the 64 processes may saturate the memory bandwidth. In some cases, however, hyper-threading can hide memory latency: some hyper-threads may be waiting for data from memory while other hyper-threads are able to use the computational resources.

The program $SAMPLES_HOME/Programming/KNL/HYBRID/xthi.c, which shows thread affinity, will be used to show how various aprun options affect the placement of ranks and threads. The following applies to the Cray compiler. Consider a pure MPI job with one thread per core, as follows:

export OMP_NUM_THREADS=1
aprun -n 64 -N 64 -d1 -j1 -cc depth ./xthi.x

MPI rank 0 is placed on core 0 and MPI rank 63 is placed on core 63. Using "-cc none"

aprun -n 64 -N 64 -d1 -j1 -cc none ./xthi.x

results in no affinity. Every rank has core affinity 0-255. For a pure MPI job, using "-cc cpu"

aprun -n 64 -N 64 -d1 -j1 -cc cpu ./xthi.x

has the same effect as "-cc depth". For quadrant cluster mode, using "-cc numa_node" is similar to "-cc none"

aprun -n 64 -N 64 -d1 -j1 -cc numa_node ./xthi.x

results in every rank having core affinity 0-63. While it may not be practical to use more than 64 MPI ranks per node, let us see the result. The PBS select line would be mpiprocs=64 even though there are more ranks per node. In the table below, the "Hyper-thread" column refers to the cores seen by /proc/cpuinfo (shown as "Processor"), which finds 256 cores on the node, numbered 0-255.

export OMP_NUM_THREADS=1
aprun -n 128 -d1 -j2 -cc depth ./xthi.x

Process Placement
Rank Hyper-thread Physical Core
0 0 0
1 64 0
2 1 1
3 65 1
64 32 32
65 96 32
66 33 33
67 97 33
127 127 63

If -N 64 was used in the above command, only the first 32 physical cores would be used; rank 64 would be placed on core 0. Now let us consider the command lines:

export OMP_NUM_THREADS=1
aprun -n 256 -d1 -j4 -cc depth ./xthi.x

Process Placement
Rank Hyper-thread Physical Core
0 0 0
1 64 0
2 128 0
3 192 0
64 16 16
65 80 16
66 144 16
67 208 16
128 32 32
129 96 32
255 255 63

If -N 128 was used in the above command, only the first 32 physical cores would be used; rank 128 would be placed on core 0. Now let us consider two threads per core. The value of "-n 32" used in the example below could be more than 32 if more than one node was selected. In this example, MPI is running on only one node. The choice of "-N 32" puts no more than 32 MPI ranks per node.

export OMP_NUM_THREADS=2
aprun -n 32 -N 32 -d2 -j1 -cc depth ./xthi.x

Process Placement
Rank OMP Thread Hyper-thread Physical Core
0 0 0, 1 0, 1
0 1 0, 1 0, 1
1 0 2, 3 2, 3
1 1 2, 3 2, 3
31 0 62, 63 62, 63
31 1 62, 63 62, 63

export OMP_NUM_THREADS=2
aprun -n 32 -N 32 -d2 -j1 -cc cpu ./xthi.x

Process Placement
Rank OMP Thread Hyper-thread Physical Core
0 0 0 0
0 1 1 1
1 0 2 2
1 1 3 3
31 0 62 62
31 1 63 63

In the above two examples, there is only one thread per core. For "-cc depth" two threads share two cores.

Suppose we want more than one thread on each physical core.

export OMP_NUM_THREADS=4
aprun -n 32 -N 32 -d4 -j2 -cc depth ./xthi.x

Process Placement
Rank OMP Thread Hyper-thread Physical Core
0 0 0, 1, 64, 65 0, 1
0 1 0, 1, 64, 65 0, 1
0 2 0, 1, 64, 65 0, 1
0 3 0, 1, 64, 65 0, 1
1 0 2, 3, 66, 67 2, 3
1 1 2, 3, 66, 67 2, 3
1 2 2, 3, 66, 67 2, 3
1 3 2, 3, 66, 67 2, 3
31 0 62, 63, 126, 127 62, 63
31 1 62, 63, 126, 127 62, 63
31 2 62, 63, 126, 127 62, 63
31 3 62, 63, 126, 127 62, 63

export OMP_NUM_THREADS=4
aprun -n 32 -N 32 -d4 -j2 -cc cpu ./xthi.x

Process Placement
Rank OMP Thread Hyper-thread Physical Core
0 0 0 0
0 1 64 0
0 2 1 1
0 3 65 1
1 0 2 2
1 1 66 2
1 2 3 3
1 3 67 3
31 0 62 62
31 1 126 62
31 2 63 63
31 3 127 63

When a rank covers just two physical cores, choosing the core affinity that has a range of values for each thread, as in the "-cc depth" example, may be better than "-cc cpu" because a pair of cores share one L2 cache.

An example of few MPI processes per node and heavy use of OpenMP threads is shown below. Notice that the depth, "-d 64", is greater than the distance between MPI ranks. It bears repeating that the total number of MPI processes could be, for example, "-n 100" if "select=50".

export OMP_NUM_THREADS=64
aprun -n 2 -N 2 -d 64 -j2 -cc depth ./xthi.x

Process Placement
Rank OMP Thread Hyper-thread Physical Core
0 0-63 0-31, 64-95 0-31
1 0-63 32-63, 96-127 32-63

There can be more threads than cores.

export OMP_NUM_THREADS=128
aprun -n 2 -N 2 -d 128 -j4 -cc depth ./xthi.x

Process Placement
Rank OMP Thread Hyper-thread Physical Core
0 0-127 0-31, 64-95, 128-159, 192-223 0-31
1 0-127 32-63, 96-127, 160-191, 224-255 32-63

For cluster mode SNC4 there are four NUMA nodes, and the "-S" option should be used to distribute MPI ranks between the NUMA nodes. As an example, consider the case of 64 ranks per node. For SNC4, the aprun command should include:

-N 64 -S 16

Another aprun option to consider is "-r N", where "N" is some small number (such as 1). This reserves threads for MPI work and kernel processes, thereby reducing jitter of the computational processes. On KNL, a Cray KNL expert wrote that a very beneficial effect has been seen when using large node counts for MPI_Allreduce() and other latency-dependent collectives. The threads used for "-r" must be taken from the computation, for example, aprun -r 4 -n 16 -N 16 -d 8 -j2

For the Intel compiler, $KMP_PLACE_THREADS and $KMP_AFFINITY can be used to control thread affinity, while using "-cc none". For OpenMP version 4, the user can control the allocation of hardware threads using $OMP_PLACES and can pin threads to hardware using $OMP_PROC_BIND. See the online document http://software.intel.com/en-us/articles/process-and-thread-affinity-for-intel-xeon-phi-processors-x200. For Intel compilers before version 17, each node had a helper thread, which made thread placement more complicated. For Intel version 17, there is no helper thread.

The program $SAMPLES_HOME/Programming/KNL/HYBRID/xthi.c can be used to see the thread affinity. Another way to see thread affinity, when using CCE, is to set the environment variable:

export CRAY_OMP_CHECK_AFFINITY=TRUE

A Cray KNL expert reports that getting affinity wrong can result in a 2 times (or more) slowdown.

7. Setting KNL Modes (Provisioning)

At the start of a batch job, the user can specify the cluster mode and memory mode of the KNL nodes assigned to a job. Three memory modes are available: MCDRAM used as cache (denoted "100"), flat mode for which MCDRAM is an addressable memory bank (denoted "0") and a mode in which half the MCDRAM is used for cache and half is used as a memory bank as in flat mode (denoted "50"). Two cluster modes are available, quadrant (denoted "quad") and SNC4 (denote "snc4"). To be more precise, the part of MCDRAM that is not used for cache is seen as one memory bank in quad mode and four memory banks in snc4 mode. The text strings used to denote the six available modes are: quad_100, quad_50, quad_0, snc4_100, snc4_50 and snc4_0.

Users can choose a mode by adding ":aoe=mode" to the select line of the batch script, for example

#PBS -l select=1:ncpus=64:mpiprocs=64:nmics=1:aoe=quad_0

If a batch script does not contain the aoe parameter, the default will be quad_100.

Keep in mind that it can take as long as 20 minutes to reboot a node into the desired mode, unless a free KNL node is already available in that mode.

8. Profiling

The Cray compiler can add hooks to the executable that are understood by the Cray profiler, "Perftools". The Intel compiler can add hooks to the executable that are understood by VTune and other Intel tools.

9. Additional Optimization Advice

This section provides additional advice related to optimization.

9.1. OMP Wait Policy

The environment variable $OMP_WAIT_POLICY specifies whether waiting threads should be active or passive. If the value is "passive", waiting threads should not consume CPU power while waiting; while the value "active" specifies that they should. The default Cray $OMP_WAIT_POLICY is "active". The default Intel $OMP_WAIT_POLICY is "passive". The value "passive" is usually best for high level threading, whereas, the value "active" is usually best for loop level threading. For the Intel program environment, there is the option $KMP_BLOCKTIME, which controls how many milliseconds a thread is spinning before it goes to sleep.