1. Knights Landing (KNL) Nodes

The Knights Landing (KNL) nodes on Onyx are similar to the standard compute nodes which have Broadwell processors. The Knights Landing Xeon Phi is not an attached coprocessor; it is a standalone self-hosted node. On the Knights Landing (KNL) nodes either MPI, OpenMP or Hybrid (MPI with OpenMP) programming models can be used.

A comparison of the floating-point capabilities of the standard compute nodes with Xeon Broadwell processes versus the KNL nodes shows the need for vectorizable code. The KNL nodes have 64 cores per node, and each core has two vector processing units (VPUs), each with 8 double-precision lanes. In contrast, each standard processor node has 44 cores (two Broadwell sockets), and each core has 4 double-precision lanes for vector processing. The clock speed of KNL cores is 1.3 GHz whereas the Broadwell cores have a clock speed of 2.8 GHz. Hence, in order to take advantage of the KNL nodes, a code should be highly vectorizable. In most cases, MPI without multithreading (without OpenMP) gives high performance using 64 ranks per node. Hybrid (MPI with OpenMP) programs can improve performance when attention is given to data locality between cores and memory (both main memory and cache).

Another feature of the KNL nodes is that in addition to 96 GB of main memory (called DDR memory), there is available 16 GB of high-bandwidth memory (called MCDRAM memory). More information is given in section 3 of this document.

This KNL Quick Start Guide will not describe optimization techniques in detail. The KNL architecture and program optimization techniques will be briefly described in this document for the purpose of motivating the compilation and runtime options.

2. How to Compile and Submit a KNL Job

Programs that run on the standard compute nodes can run on the KNL nodes with no changes, though they need to be recompiled. It is necessary to change the programming environment using the command:

module swap craype-broadwell craype-mic-knl

The user does not need to set any special compiler flags when using the compiler wrappers ftn, cc, and CC. With the craype-mic-knl module, compiler wrappers will have the appropriate KNL flags for all three overall programming environments: PrgEnv-cray, PrgEnv-intel, and PrgEnv-gnu.

In addition, a huge page module should be loaded. This is particularly important for large arrays so that array access does not often cross a virtual memory boundary. For static executables, the choice of page size at runtime (specified in the job script) can be different from the page size during compilation. For dynamic executables (which use shared libraries), the page size should be consistent between compilation and runtime. The default page when compiling on the login nodes is 4 KB. It is expected that the default page size on the KNL nodes will be 2 MB, corresponding to module craype-hugepages2M. In which case, before compiling use the command:

module load craype-hugepages2M

The user can try other page sizes for compilation and runtime. The page sizes range from 2 MB to 512 MB. The available page sizes can be seen with the command:

module avail craype-hugepages

For the Cray compiler, which is the default, the module and compile/link commands are:

module swap craype-broadwell craype-mic-knl
module load craype-hugepages2M
cc -o my_prog.exe my_c_code.c
CC -o my_prog.exe my_c++_code.cpp
ftn -o my_prog.exe my_fortran_code.F90

For the Intel compiler, the module and compile/link commands are:

module swap PrgEnv-cray PrgEnv-intel
module swap craype-broadwell craype-mic-knl
module load craype-hugepages2M
cc -o my_prog.exe -qopenmp my_c_code.c
CC -o my_prog.exe -qopenmp my_c++_code.cpp
ftn -o my_prog.exe -qopenmp my_fortran_code.F90

In order to request a KNL node, the PBS job script should have the following line:

#PBS -l select=NUMNODES:ncpus=64:mpiprocs=64:nmics=1

A complete example can be found in the directory $SAMPLES_HOME/Programming/KNL/HYBRID.

3. High Bandwidth MCDRAM

A key feature of the KNL processor is the availability of 16 GB of high-bandwidth memory, called MCDRAM memory. In addition, each node has 90 GB of DDR memory. (Each node has 96 GB of physical DDR memory, of which approximately 90 GB is available for user programs.) The latency of MCDRAM and DDR are similar. DDR memory has of latency of around 125 milliseconds, whereas MCDRAM memory has a latency of around 150-175 milliseconds, depending on the memory mode. On the other hand, MCDRAM memory can have a bandwidth as much as five times DDR memory (when used in flat mode).

Several memory modes are available, the default is "cache", for which the MCDRAM is not visible to the program, but serves as cache for the DDR memory. In this mode, the effective bandwidth of main memory accesses is enhanced for data cached in MCDRAM. The MCDRAM is not a low-latency L3 cache. The KNL processors only have L1 and L2 cache, and no L3 cache. Each core has a 32-KB L1 cache, and every pair of cores (called a "tile"), share a 1-MB L2 cache.

Another memory mode is "flat", for which the MCDRAM is addressable and does not serve as cache for the DDR memory. In flat mode, the most heavily used arrays can be assigned to the MCDRAM memory.

When OpenMP is used, the memory must be allocated in a bank near the core that will use the contents of the allocated memory. Described briefly, memory is actually allocated when first touched, that is, assigned values. So the same OpenMP threads that initialize memory should be the ones that use the memory.

4. Cluster Modes

The default cluster mode on Onyx is "quadrant". Despite the name implying four parts, a KNL node in quadrant mode is a single Non-Uniform Memory Access (NUMA) node. That is, memory access speed is uniform over cores in the quadrant mode. (The "quadrant" term refers to the organization of tags that indicate which memory port to use for a given memory page, an aspect that is transparent to the user.) On the other hand, the cluster mode "SNC4" results in four NUMA nodes per KNL node. The cluster mode SNC2 (two NUMA nodes) is not available on Onyx in order to reduce the amount of mode switching.

The aprun option "-S N" specifies the number of processes, N, per NUMA node. For 64 MPI processes per KNL node, in SNC4 mode, the processes can be uniformly distributed using the following aprun command.

aprun -n total-ranks -N 64 -S 16[ ... ]

5. Memory Modes

The default memory mode on Onyx is "cache". A node can be rebooted into three memory modes, 100%, 50%, or 0% of MCDRAM dedicated to cache. (A fourth memory mode, 25%, is not available on Onyx in order to reduce the amount of mode switching.) The last mode is called "flat". Many applications run well with 100% cache. If all of the data of an application is less than 16 GB, then the flat mode can improve performance. For cluster mode "quadrant", to have all data in MCDRAM for non-cache mode, use the following aprun option:

aprun [ ... ] numactl --membind=1 ./myprog.exe

When "membind" is used, the program will fail if more than 16 GB are needed. If the program might use more than 16 GB, yet use of MCDRAM is preferred, the following aprun option can be used:

aprun [ ... ] numactl --preferred=1 ./myprog.exe

The argument of "1" for "--membind" or "--preferred" refers to MCDRAM. The DDR memory would be referred-to by an argument of "0". The portion of memory assigned to cache is invisible to numactl. For SNC4, the memory partitions are more complex and will not be described in the quick start guide. The memory partitions can be seen using the command:

numactl --hardware (numactl -H)

Recall that for a Cray supercomputer, an interactive PBS job does not put the user onto a compute node, but rather, puts the user on a batch node. Similarly, a batch script is interpreted by a batch node. So the command "numactl -H" that is run in an interactive session or a batch script would not show you the configuration of a KNL node. To obtain information about a KNL compute node, use

aprun -n 1 -b numactl -H

The "-b" option means that the executable "numactl" is found on the KNL node, and does not need to be transferred from the file system.

6. Examples

Consider a pure MPI job on one node with one thread per core, as follows:

aprun -n 64 ./xthi.x

The KNL processor can support up to four hyper-threads per core. In order to use two-way hyper-threading for 128 threads per node, the aprun option is "-j2". To use all four hyper-threads for 256 threads per node, the aprun option is "-j4". The PBS select line should be mpiprocs=64 even though there are more ranks per node than 64.

Examples of using pure MPI with 2 or 4 hyper-threads:

aprun -n 128 -j2 ./xthi.x
aprun -n 256 -j4 ./xthi.x

Now let us consider two threads per core. The depth option –d is always necessary with OpenMP.

aprun -n 32 -d2 ./xthi.x

Suppose we want more than one thread on each physical core.

aprun -n 32 -d4 -j2 ./xthi.x

An example of few MPI processes per node and heavy use of OpenMP threads is shown below.

aprun -n 2 -d 64 -j2 ./xthi.x

export OMP_NUM_THREADS=128
aprun -n 2 -d 128 -j4 ./xthi.x

Note, when the Intel compiler is used with an OpenMP or hybrid model, the aprun command line should contain the option "-cc depth" or "-cc numa_node". For the Intel compiler, $KMP_PLACE_THREADS and $KMP_AFFINITY can be used to control thread affinity.

For cluster mode SNC4 there are four NUMA nodes, and the "-S" option should be used to distribute MPI ranks between the NUMA nodes. As an example, consider the case of 64 ranks per node. For SNC4, the aprun command should include:

-N 64 -S 16

Another aprun option to consider is "-r N", where "N" is some small number (such as 1). This reserves threads for MPI work and kernel processes, thereby reducing jitter of the computational processes. On KNL, a Cray KNL expert wrote that a very beneficial effect has been seen when using large node counts for MPI_Allreduce() and other latency-dependent collectives. The threads used for "-r" must be taken from the computation, for example, aprun -r 4 -n 16 -N 16 -d 8 -j2.

For OpenMP version 4, the user can control the allocation of hardware threads using $OMP_PLACES and can pin threads to hardware using $OMP_PROC_BIND. See the online document Process and Thread Affinity for Intel Xeon Phi Processors.

The program $SAMPLES_HOME/Programming/KNL/HYBRID/xthi.c can be used to see the thread affinity. Another way to see thread affinity, when using CCE, is to set the environment variable:


7. Setting KNL Modes (Provisioning)

At the start of a batch job, the user can specify the cluster mode and memory mode of the KNL nodes assigned to a job. Three memory modes are available: MCDRAM used as cache (denoted "100"), flat mode for which MCDRAM is an addressable memory bank (denoted "0") and a mode in which half the MCDRAM is used for cache and half is used as a memory bank as in flat mode (denoted "50"). Two cluster modes are available, quadrant (denoted "quad") and SNC4 (denote "snc4"). To be more precise, the part of MCDRAM that is not used for cache is seen as one memory bank in quad mode and four memory banks in snc4 mode. The text strings used to denote the six available modes are: quad_100, quad_50, quad_0, snc4_100, snc4_50 and snc4_0.

Users can choose a mode by adding ":aoe=mode" to the select line of the batch script, for example

#PBS -l select=1:ncpus=64:mpiprocs=64:nmics=1:aoe=quad_0

If a batch script does not contain the aoe parameter, the default will be quad_100.

Keep in mind that it can take as long as 20 minutes to reboot a node into the desired mode, unless a free KNL node is already available in that mode.

8. Profiling

The Cray compiler can add hooks to the executable that are understood by the Cray profiler, "Perftools". The Intel compiler can add hooks to the executable that are understood by VTune and other Intel tools.

9. Additional Optimization Advice

This section provides additional advice related to optimization.

9.1. OMP Wait Policy

The environment variable $OMP_WAIT_POLICY specifies whether waiting threads should be active or passive. If the value is "passive", waiting threads should not consume CPU power while waiting; while the value "active" specifies that they should. The default Cray $OMP_WAIT_POLICY is "active". The default Intel $OMP_WAIT_POLICY is "passive". The value "passive" is usually best for high level threading, whereas, the value "active" is usually best for loop level threading. For the Intel program environment, there is the option $KMP_BLOCKTIME, which controls how many milliseconds a thread is spinning before it goes to sleep.