Cray XC40/50 (Onyx)
MLA Quick Start Guide
Table of Contents
1. Machine Learning Accelerator (MLA) Nodes
The Machine Learning Accelerator (MLA) nodes on Onyx have a pair of Intel Skylake processors with 40 cores per node. MLA nodes also possess either two or ten NVIDIA V100 GPUs. The nodes with two MLAs per node contain SXM2 V100s with 16 GB of memory each. The nodes with ten MLAs per node contain PCIe V100s with 32 GB of memory each.
MLA nodes on Onyx communicate over InfiniBand and are not a part of the Onyx high-speed Aries network. MLA nodes are scheduled by PBS Pro but are not managed by the Cray internal scheduler, ALPS. Therefore, jobs that run on MLA nodes will not first land on a batch node and will not require "aprun" to reach a compute node. MLA jobs behave like a traditional cluster by placing the user directly on the first compute node in your job.
The operating system on the MLA nodes is CentOS, unlike the login nodes which use SLES and the compute nodes which use Cray Linux Environment (CLE). Available and default modules will differ between the login and MLA nodes. Therefore, users cannot compile codes for MLA nodes on the login nodes. Users will need to start an interactive PBS job and do their compiling directly on an MLA node.
2. How to Run an MLA Job
In order to compile for an MLA run, users need to be on an MLA node and not on a standard Onyx login node. To get an interactive batch session, you must first submit an interactive batch job through PBS. This is done by executing the qsub command with the "-I" argument. For example, this will give you a prompt on a node with 2 MLAs:
qsub –A Project_ID –q debug –l select=1:ncpus=40:mpiprocs=40:nmlas=2 –l walltime=1:00:00 –I
Your interactive batch job will be scheduled just as normal batch jobs are scheduled, so it may take a while to start. This example uses the debug queue, but another queue for which you are eligible could also be used. Once your interactive batch session starts, you will have a prompt on a node with a host name of nid0#### with a number between 6001 and 6064. Here you can modify your modules and compile, execute, and debug your code interactively.
PBS batch jobs for MLA nodes accept the same arguments as non-MLA nodes, with project, queue, walltime, and select statements being required. The select statement adds nmlas, as follows:
where you specify the number of nodes requested (N1), the number of MPI processes per node (N2), and the number of MLAs per node (N3). The number of MLAs per node can only be 2 or 10.
3. MLA Allocation
MLA jobs are not billed against your allocation by CPU hours, but instead are billed by MLA hours. If this were a standard compute node, an hour of use would reduce a user's allocation by 40, which is the number of standard compute cores (ncpus) on the node. However, an hour of use of an MLA node will reduce a user’s allocation by 2 or 10 depending on the size of the MLA node requested. MLA jobs will use a different allocation from PBS jobs, even though both may share the same 13-character project in their PBS batch scripts.
4. How to Compile and Execute
Users must be on an MLA node in order to compile codes to run on MLA processors. See section 2 above for obtaining an interactive PBS session on a MLA node. Once any binaries have been compiled, users can submit typical batch jobs on Onyx login nodes for execution on MLA nodes.
Several compilers are available on MLA nodes including GNU, Cray, and Intel. CUDA and a number of MPI varieties are also available. Run "module avail" on an MLA node to see the full list. MLA users should not compile for MPI with "cc" but instead use "mpicc" after loading their desired MPI module. Also instead of running with "aprun", the MPI launch command is "mpirun" or "mpiexec". These commands are more familiar to non-Cray users as they are standard on traditional clusters.
Some examples can be found in the Sample Code Repository at $SAMPLES_HOME/Programming/MLA.