Accessing Adastra

This document is a quick start guide for the Adastra machine. You can find additional information on the GENCI’s website and in this booklet.

Account opening

To access Adastra you need to have an account on the Demande d’Attribution de Ressources Informatique (DARI)’s website. Then, on eDARI, you need to ask to be associated to a research project with attributed Adastra compute hours. Following that, you can ask on eDARI for your personal account to be created on the machine (Adastra in this context). You will have to fill in a form, which to be valid, needs the three parties below have dated and electronically signed your account request:

  • The person who made the request;

  • the user’s security representative (often related to his laboratory);

  • the laboratory director.

You will then receive, via email, the instructions containing your credentials.

Connecting

To connect to Adastra, ssh to adastra.cines.fr.

$ ssh <login>@adastra.cines.fr

Warning

Authenticating to Adastra using ssh keys is not permitted. You will have to enter your password.

To connect to a specific login node, use:

$ ssh <login>@adastra<login_node_number>.cines.fr

Where <login_node_number> represents a integer login node identifier. For instance, ssh anusername@adastra5.cines.fr will connect you to the login number 5.

X11 forwarding

Automatic forwarding of the X11 display to a remote computer is possible with the use of SSH and a local (i.e., on your desktop) X server. To set up automatic X11 forwarding within SSH, you can do one of the following:

Invoke ssh with -X:

$ ssh -X <login>@adastra.cines.fr

Note that use of the -x flag (lowercase) will disable X11 forwarding. Users should not manually set the ${DISPLAY} environment variable for X11 forwarding.

Warning

If you have issues when launching a GUI application, make sure this is not related to the .Xauthority file. If it is, or you are not sure it is, checkout the .Xauthority file document.

Login unique

The login unique (in english, single sign on or unique login) is a new feature of the CINES’ supercomputer that will enable a user to work on multiple projects using a single, unique login. These logins (also called username) will be valid the lifetime of the machine (though the data may not, see Quotas for more details). This simplifies authentication over time. This procedure is already used in the other two national centres (IDRIS and TGCC). The method for logging into the machine remains the same as before and as described above. Once you are logged in, you get access to one of your home directory which is the home associated to your current project (if you have one). At this stage, you can adapt your environment to the project you wish to work on with the help of the command myproject.

The unique login tools will modify your Unix group and some environment variables. If you use scripts that are automatically loaded or that are expected in a specific location (say .bashrc) checkout the notes in the Layout of common files and directories and Accessing the storage areas documents.

In this section we will present the myproject command. When freshly connected, your shell’s working directory will be your current project’s personal home directory or, if your account is not linked to any account, your personal home. Again refer to Accessing the storage areas for more details on the various storage areas. Your first step could be to list the flags myproject supports and that can be done like so:

$ myproject --help
usage: my_project.py [-h] [-s [project] | -S | -l | -a project | -c | -C | -m [project]]

Manage your hpc projects. The active project is the current project in your
session.

optional arguments:
-h, --help            show this help message and exit
-s [project], --state [project]
                        Get current HPC projects state
-S, --stateall        Get all HPC projects state
-l, --list            List all authorized HPC projects
-a project, --activate project
                        Activate the indicated project
-c, --cines           List projects directories CINES variables
-C, --ccfr            List projects directories CCFR variables
-m [project], --members [project]
                        List all members of a project

The most used commands are -l to list the project we are assigned to, -a to switch project and -c to list the environment variable described in Accessing the storage areas.

Listing the environment variables and their value

This is done like so (assuming a user with login someuser):

$ myproject -c
Liste des variables CINES permettant l'accès aux répertoires dans les différents espaces de stockage
----------------------------------------------------------------------------------------------------
Project actif: dci

OWN_HOMEDIR :     /lus/home/PERSO/grp_someuser/someuser

HOMEDIR :          /lus/home/BCINES/dci/someuser
SHAREDHOMEDIR :    /lus/home/BCINES/dci/SHARED
SCRATCHDIR :       /lus/scratch/BCINES/dci/someuser
SHAREDSCRATCHDIR : /lus/scratch/BCINES/dci/SHARED
WORKDIR :          /lus/work/BCINES/dci/someuser
SHAREDWORKDIR :    /lus/work/BCINES/dci/SHARED
STOREDIR :         /lus/store/BCINES/dci/someuser


gda2212_HOMEDIR :          /lus/home/NAT/gda2212/someuser
gda2212_SHAREDHOMEDIR :    /lus/home/NAT/gda2212/SHARED
gda2212_SCRATCHDIR :       /lus/scratch/NAT/gda2212/someuser
gda2212_SHAREDSCRATCHDIR : /lus/scratch/NAT/gda2212/SHARED
gda2212_WORKDIR :          /lus/work/NAT/gda2212/someuser
gda2212_SHAREDWORKDIR :    /lus/store/NAT/gda2212/SHARED
gda2212_STOREDIR :         /lus/store/NAT/gda2212/someuser

dci_HOMEDIR :          /lus/home/BCINES/dci/someuser
dci_SHAREDHOMEDIR :    /lus/home/BCINES/dci/SHARED
dci_SCRATCHDIR :       /lus/scratch/BCINES/dci/someuser
dci_SHAREDSCRATCHDIR : /lus/scratch/BCINES/dci/SHARED
dci_WORKDIR :          /lus/work/BCINES/dci/someuser
dci_SHAREDWORKDIR :    /lus/store/BCINES/dci/SHARED
dci_STOREDIR :         /lus/store/BCINES/dci/someuser

Observe that the actif project (current project in english) is dci in the example above. This should be interpreted as: the shell being currently setup so that the generic environment variables point to the project’s filesystem directories. For instance ${SHAREDSCRATCHDIR} would point to the actif project’s group shared scratch space, in this case, /lus/scratch/BCINES/dci/SHARED. For more details on the file system spaces CINES offers, see Accessing the storage areas.

As such, an actif project does not relate to a DARI related notion of activated, valid, ongoing, etc..

Listing associated projects

This is done like so (assuming a user with login someuser):

$ myproject -l
Projet actif: dci

Liste des projets de calcul associés à l'utilisateur 'someuser' : ['gda2212', 'dci']

Switching project

You can rely on the ${ACTIVE_PROJECT} environment variable to obtain the currently used project:

$ echo ${ACTIVE_PROJECT}
dci

This is done like so (assuming a user with login someuser):

$ myproject -a gda2212
Projet actif :dci

Bascule du projet "dci" vers le projet "gda2212"
Projet " gda2212 " activé.
$ myproject -c
Liste des variables CINES permettant l'accès aux répertoires dans les différents espaces de stockage
----------------------------------------------------------------------------------------------------
Project actif: gda2212

OWN_HOMEDIR :     /lus/home/PERSO/grp_someuser/someuser

HOMEDIR :          /lus/home/NAT/gda2212/someuser
SHAREDHOMEDIR :    /lus/home/NAT/gda2212/SHARED
SCRATCHDIR :       /lus/scratch/NAT/gda2212/someuser
SHAREDSCRATCHDIR : /lus/scratch/NAT/gda2212/SHARED
WORKDIR :          /lus/work/NAT/gda2212/someuser
SHAREDWORKDIR :    /lus/work/NAT/gda2212/SHARED
STOREDIR :         /lus/store/NAT/gda2212/someuser


gda2212_HOMEDIR :          /lus/home/NAT/gda2212/someuser
gda2212_SHAREDHOMEDIR :    /lus/home/NAT/gda2212/SHARED
gda2212_SCRATCHDIR :       /lus/scratch/NAT/gda2212/someuser
gda2212_SHAREDSCRATCHDIR : /lus/scratch/NAT/gda2212/SHARED
gda2212_WORKDIR :          /lus/work/NAT/gda2212/someuser
gda2212_SHAREDWORKDIR :    /lus/store/NAT/gda2212/SHARED
gda2212_STOREDIR :         /lus/store/NAT/gda2212/someuser

dci_HOMEDIR :          /lus/home/BCINES/dci/someuser
dci_SHAREDHOMEDIR :    /lus/home/BCINES/dci/SHARED
dci_SCRATCHDIR :       /lus/scratch/BCINES/dci/someuser
dci_SHAREDSCRATCHDIR : /lus/scratch/BCINES/dci/SHARED
dci_WORKDIR :          /lus/work/BCINES/dci/someuser
dci_SHAREDWORKDIR :    /lus/store/BCINES/dci/SHARED
dci_STOREDIR :         /lus/store/BCINES/dci/someuser

As you can see, the ${HOMEDIR}, ${SHAREDHOMEDIR} etc. have changed when the user switched project (compared to the output presented here). That said, the prefixed variables like ${dci_HOMEDIR} didn’t change and using it is the recommended way to reference a directory assuming you do not know which project will be loaded when the variable will be used (say, in a script).

Some issues can be encountered when using tools that are unaware of the many home structure. Yat again, check the Layout of common files and directories and Accessing the storage areas documents.

Layout of common files and directories

Due to new functionalities introduced through Login unique, you may find the Accessing the storage areas document useful. It describes the multiple home directories and how to access them through environment variable (${HOMEDIR}, ${OWN_HOMEDIR} etc.).

Some subtleties needs addressing, see below.

.bashrc file

Your .bashrc file should be accessible in the ${HOMEDIR} directory (project personal home).

Using symbolic links, you can prevent file redundancy by first, storing our .bashrc in your ${OWN_HOMEDIR} and creating a link in your ${HOMEDIR}. Effectively, you are factorizing the .bashrc:

$ ln -s "${OWN_HOMEDIR}/.bashrc" "${HOMEDIR}/.bashrc"

If you want your .bashrc to be loaded when you login to the machine you need to make sure a file called .bash_profile is present in your ${HOMEDIR} directory (project personal home). This file, if not present, should thus be created to contain:

if [ -f ~/.bashrc ]; then
    source ~/.bashrc
fi

Similarly to the .bashrc you can use links to factorize this file.

.ssh directory

Your .ssh directory should be accessible in the ${OWN_HOMEDIR} directory (personal home).

Optionally, you can create link in your ${HOMEDIR} to point to ${OWN_HOMEDIR}/.ssh

.Xauthority file

Your .Xauthority file should be accessible in the ${HOMEDIR} directory (project personal home).

In practice, this file gets created by the system in the ${OWN_HOMEDIR} directory (personal home). You need to create a link like so:

$ ln -s "${OWN_HOMEDIR}/.Xauthority" "${HOMEDIR}/.Xauthority"

Note

Make sure your ${XAUTHORITY} environment variable correctly points to ${OWN_HOMEDIR}/.Xauthority.

Programming environment

The programming environment includes compiler toolchains, libraries, performance analysis and debugging tools and optimized scientific libraries. Adastra, being a Cray machine, it uses the Cray Programming Environment abbreviated CrayPE or CPE. In practice a CrayPE is simply a set of module. This section tries to shed light on the subtleties of the system’s environment.

The Cray documentation is available in the man pages (prefixed with intro_) and is starting to be mirrored and enhanced at this URL https://cpe.ext.hpe.com/docs/.

Module, why and how

Like on many HPC machines, the software is presented through modules. A module can be mostly seen as a set of environment variable. Variables such as the ${PATH}, ${LD_LIBRARY_PATH} are modified to introduce new tools in the environment. The software providing the module concept is Lmod, a Lua-based module system for dynamically altering a shell environment.

General usage

The interface to Lmod is provided by the module command:

Command

Description

module list

Shows the list of the currently loaded modules.

module overview

Shows a view of modules aggregated over the versions.

module available

Shows a table of the currently available modules.

module --show_hidden available

Shows a table of the currently available modules and also show hidden module (very useful !).

module purge

Unloads all modules.

module show <modulename>

Shows the environment changes made by the <modulename> modulefile.

module load <modulename> [...]

Loads the given <modulename>(s) into the current environment.

module help <modulename>

Shows help information about <modulename>.

module spider <string>

Searches all possible modules according to <string>.

module use <path>

Adds <path> to the modulefile search cache and ${MODULESPATH}.

module unuse <path>

Removes <path> from the modulefile search cache and ${MODULESPATH}.

module update

Reloads all currently loaded modules.

Lmod introduces the concept of default and currently loaded modules. When the user enters the module available command, he may get something similar to the small example given below.

$ module available
---- /opt/cray/pe/lmod/modulefiles/comnet/crayclang/14.0/ofi/1.0 ----
cray-mpich/8.1.20 (L,D)    cray-mpich/8.1.21

Where:
 L:  Module is loaded
 D:  Default Module

Note the L and D described at the end of the example. It shows you what is loaded and what is loaded by default when you do not specify the version of a module (that is, you omit the /8.1.21 for instance). Note that D does not mean it is loaded automatically but that, if a module is to be loaded (say cray-mpich) and the version is not specified, then, it’ll load the module marked by D (say cray-mpich/8.1.20). It is considered good practice to specify the full name to avoid issues related to more complicated and complex topics (compilation, linkage, etc.).

Note

By default some modules are loaded and this differs from older machines hosted at CINES such as Occigen.

Note

The --terse option can be useful when the output of the module command needs to be parsed in scripts.

Looking for a specific module or an already installed software

Modules with dependencies are only available (show in module available) when their dependencies, such as compilers, are loaded. To search the entire hierarchy across all possible dependencies, the module spider command can be used as summarized in the following table.

Command

Description

module spider

Shows the entire possible graph of modules.

module spider <modulename>

Searches for modules named <modulename> in the graph of possible modules.

module spider <modulename>/<version>

Searches for a specific version of <modulename> in the graph of possible modules.

module spider <string>

Searches for modulefiles containing <string>.

CrayPE basics

The CrayPE is often feared due to its apparent complexity. We will try to present the basic building blocs and show how assembling these blocs.

At a high level, the a Cray environment is made up of:

  • External libraries (such as the ones in ROCm);

  • Cray libraries (MPICH, libsci);

  • Architecture modules (craype-accel-amd-gfx90a);

  • Compilers (craycc as the cce module, amdclang as the amd module, gcc as gnu module);

  • The Cray compiler wrappers (cc, CC, ftn) offered by the craype module;

  • The PrgEnv modules (PrgEnv-cray);

  • And the cpe/XX.YY.

The external libraries refer to libraries the CrayPE requires but are not the property of Cray, AMD’s ROCm is such an example. The Cray libraries are closed source software, there are multiple variants of the same library to accommodate for the GPU and many compiler support. The architecture modules will change the wrapper’s behavior (see Cray compiler wrapper) by helping choosing which library to link against (say, the MPICH GPU plugin), or modifying the flags such as -march=zen4. The compilers are not recommended to be directly used; they should instead be used through the Cray compiler wrapper which will interpret the PrgEnv, the loaded Cray library and architecture modules to handle the compatibility matrix transparently (with few visible artifacts). The PrgEnv are preset environments, you can choose to use them or cherry-pick you own set of module, at your own risk. The cpe/XX.YY modules are used to change the default version of the above mentioned modules and allows you to operate a set of intercompatible default modules.

../_images/craype_interactions.png

Graphical representation of the CrayPE component interactions.

Note

There is an order in which we recommend loading the modules. See the note in Targeting an architecture.

Important

Do not forget to export the appropriate environment variable such as CC, CXX etc. and make them point to the correct compiler or Cray compiler wrapper by loading the correct PrgEnv. This is can be crucial for tools like CMake and Make.

Changing CrayPE version

A Cray Programming Environment (CrayPE) can be simply viewed as a set of module (of a particular version). Switching CrayPE is like switching modules and defining new versions.

You can load a cpe/XX.YY module to prepare your environment with the modules associated to a specific XX.YY version of cpe. In practice, it will change the version of your loaded modules to match the version the cpe/XX.YY in question is expecting and, in addition, will modify the default version of the Cray modules.

Warning

If you use a cpe/XX.YY module, it must come first before you load any other Cray modules.

Important

You can preload a cpe/XX.YY module before preparing your environment to be sure you are using the correct version of the modules you load.

As an example:

 1$ module available cpe
 2-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
 3    cpe/22.11    cpe/22.12    cpe/23.02 (D)
 4$ module purge
 5-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
 6    cce/15.0.0    cce/15.0.1 (D)
 7$ module load PrgEnv-cray
 8$ module list
 9Currently Loaded Modules:
10    1) cce/15.0.1   2) craype/2.7.19   3) cray-dsmml/0.2.2
11    2) libfabric/1.15.2.0   5) craype-network-ofi   6) cray-mpich/8.1.24
12    3) cray-libsci/23.02.1.1   8) PrgEnv-cray/8.3.3
13$ module load cpe/22.12
14The following have been reloaded with a version change:
15  1) cce/15.0.1 => cce/15.0.0
16  2) cray-libsci/23.02.1.1 => cray-libsci/22.12.1.1
17  3) cray-mpich/8.1.24 => cray-mpich/8.1.23
18$ module available cce
19-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
20    cce/15.0.0 (L,D)    cce/15.0.1
21$ module load cpe/23.02
22Unloading the cpe module is insufficient to restore the system defaults.
23Please run 'source /opt/cray/pe/cpe/22.12/restore_lmod_system_defaults.[csh|sh]'.
24
25The following have been reloaded with a version change:
26  1) cce/15.0.0 => cce/15.0.1
27  2) cpe/22.12 => cpe/23.02
28  3) cray-libsci/22.12.1.1 => cray-libsci/23.02.1.1
29  4) cray-mpich/8.1.23 => cray-mpich/8.1.24
30$ module available cce
31-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
32    cce/15.0.0    cce/15.0.1 (L,D)

As we can see, the cpe/22.12 changed the modules version and also changed the default modules version.

Note

Loading a cpe module will lead to a quirk which is shown line 22. The quirks comes from the fact that unloading a module that switches modules does not bring the environment back to it states before the switching, in fact, it does nothing. Once the module is unloaded, the default module version are restored but we have to load them back. This is the role of the above mentioned script (restore_lmod_system_defaults.sh).

Cray compiler wrapper

As you may know, compatibilities between compilers and libraries is not always guaranteed and a compatibility matrix can be given to the user who are left to themselves to figure out how to combine the software components. Loading the PrgEnv-<compiler>[-<compiler2>] module introduces a compiler wrapper (also called driver) which will interpret environment variables introduced by other Cray modules such as craype-accel-amd-gfx90a (see Targeting an architecture for more details), cray-mpich, etc.. The driver creates the toolchain needed to satisfy the request (compilation, optimization, link, etc.). It also uses the information gathered in the environment to specify include paths, link flags, architecture specific flags, etc. that the underlying compiler needs to produce code. Effectively, theses compiler wrappers abstract the compatibility matrix away from the user; linking and providing the correct headers at compile and run time is only a subset of the features provided by the Cray compiler wrappers. If you do not use the wrappers, you will have to do more work and expose yourself to error prone manipulations.

PrgEnv and compilers

The compilers available on Adastra are provided through the Cray environment modules. Most of the readers already know about the GNU software stack. Adastra comes with three more supported compilers. The Cray and the AMD Radeon Open Compute (ROCm) compilers are both based on the state of the art LLVM Compiler Infrastructure. In fact you can treat these compilers as vendor recompiled Clang/Flang LLVM compilers with added optimization passes or OpenMP backend in the case of the Cray compiler (but not much more). The AMD Optimizing C/C++ Compiler (AOCC) compiler resemble the Intel ICC compiler, but for AMD. The AOCC compiler is based on LLVM. There is also a system (OS provided) versions of GCC available in /usr/bin (try not using it).

The Programming environment column of the table below represent the module to load to beneficiate from a specific environment. You can load a compiler module after loading a PrgEnv to choose a specific version of a compiler belonging to a given PrgEnv. That is, load cce/15.0.0 after loading PrgEnv-cray to make sure you get the cce/15.0.0 compiler. The modules loaded by a PrgEnv will change as the environment evolves. After the first load of a PrgEnv, you are recommended to save the module implicitly loaded (module list) and explicitly load them to avoid future breakage.

Vendor

Programming environment

Compiler module

Language

Compiler wrapper

Raw compiler

Usage and notes

Cray

PrgEnv-cray

cce

C

cc

craycc

For CPU and GPU compilations. craycc and craycxx are LLVM based while crayftn is entirely proprietary. cce means Cray Compiling Environment.

C++

CC

craycxx or crayCC

Fortran

ftn

crayftn

AMD

PrgEnv-amd

amd

C

cc

amdclang

For CPU and GPU compilations. This module introduces the ROCm stack. ROCm is AMD’s GPGPU software stack. These compilers are open source and available on Github. You can contact AMD via Github issues.

C++

CC

amdclang++

Fortran

ftn

amdflang (new Flang)

AMD

PrgEnv-aocc

aocc

C

cc

clang

For CPU compilations. These compilers are LLVM based but the LLVM fork are not open sourced.

C++

CC

clang++

Fortran

ftn

flang (classic Flang)

GNU

PrgEnv-gnu

gcc

C

cc

gcc

For CPU compilations.

C++

CC

g++

Fortran

ftn

gfortran

Intel

PrgEnv-intel

intel

C

cc

icx

For CPU compilations. The historical ifort compiler and the OneAPI icx and icpx. compiler.

C++

CC

icpx

Fortran

ftn

ifort

Intel

PrgEnv-intel

intel-classic

C

cc

icc

For CPU compilations. Intel’s historical (but good) toolchain.

C++

CC

icpc

Fortran

ftn

ifort

Intel

PrgEnv-intel

intel-oneapi

C

cc

icx

For CPU compilations. Intel’s new toolchain based on LLVM and trying to democratize Sycl.

C++

CC

icpx

Fortran

ftn

ifx

Note

Reading (and understanding) the craycc or crayftn man pages will provide you with valuable knowledge on the usage of the Cray compilers.

Important

It is highly recommended to use the Cray compiler wrappers (cc, CC, and ftn) whenever possible. These are provided whichever programming environment is used. These wrappers are somewhat like the mpicc provided by other vendors.

Switching compiler is as simple as loading an other PrgEnv. The user only needs to recompile the software, assuming the build scripts or build script generator scripts (say CMake scripts) are properly engineered.

For CPU compilations:

  • C/C++ codes can rely on PrgEnv-gnu, PrgEnv-aocc or PrgEnv-cray;

  • Fortran codes can rely on PrgEnv-gnu, PrgEnv-cray or PrgEnv-intel.

Note

If you target the Genoa CPUs, you must ensure that the GCC version is more recent or equal to gcc/13.2.0.

For GPU compilations:

  • C/C++ codes can rely PrgEnv-amd, PrgEnv-cray or potentially PrgEnv-gnu with rocm;

  • Fortran codes can rely PrgEnv-cray (required for OpenMP target/OpenACC + Fortran).

To know which compiler/PrgEnv to use depending on the parallelization technology your program relies on (OpenMP, OpenACC, HIP, etc.), check this table.

Note

Understand that while both AMD softwares, PrgEnv-amd and PrgEnv-aocc target a fundamentally different node kind, the first one is part of the ROCm stack (analogous to NVHPC), the second one is an historical CPU compiler (analogous to Intel’s ICC).

The PrgEnv-cray (CCE), PrgEnv-amd (ROCm), PrgEnv-gnu, PrgEnv-aocc and PrgEnv-aocc all support the following C++ standards (and implied C standards): c++11, gnu++11, c++14, gnu++14, c++17, gnu++17, c++20, gnu++20, c++2b, gnu++2b. Some caveats exist regarding C++ modules with C++20. All theses compilers (expect GNU), are based on Clang.

the Fortran compiler all support the following standards: f90, f95, f03.

Warning

If your code has, all along its life, relied on non standard, vendor specific extensions, you may have issues using an other compiler.

PrgEnv mixing and subtleties

Cray provides the PrgEnv-<compiler>[-<compiler2>] modules (say, PrgEnv-cray-amd) that load a given <compiler> and toolchain and optionally, if set, introduce an additional <compiler2>. In case a <compiler2> is specified, the Cray environment will use <compiler> to compile Fortran sources and <compiler2> for C and C++ sources. The user can then enrich his environment by loading other libraries through modules (though some of these libraries are loaded by default with the PrgEnv).

Introducing an environment, toolchain or tool through the use of modules means that loading a module will modify environment variables such as ${PATH}, ${ROCM_PATH}, ${LD_LIBRAY_PATH} to make the tool or toolchain available to the user’s shell.

For example, say you wish to use the Cray compiler to compile CPU or GPU code, introduce the CCE toolchain this way:

$ module load PrgEnv-cray

Say you want to use the Cray compiler to compile Fortran sources and use the AMD compiler for C and C++ sources, introduce the CCE and ROCm toolchains this way:

$ module load PrgEnv-cray-amd

Say you want to use the AMD compiler to compile CPU or GPU code, introduce the ROCm toolchain this way:

$ module load PrgEnv-amd

Mixing PrgEnv and toolchain

Say you want to use the Cray compiler to compile CPU or GPU code and also have access to the ROCm tools and libraries, introduce the CCE and ROCm tooling this way:

$ module load PrgEnv-cray amd-mixed

Mixing compilers and tooling is achieved through the *-mixed modules. *-mixed modules do not significantly alter the Cray compiler wrapper’s behavior. They can be used to steer the compiler in using, say, the correct ROCm version instead of the default one (/opt/rocm).

*-mixed modules can be viewed as an alias to the underlying software. For instance, amd-mixed would be an alias for the rocm module.

Targeting an architecture

In a Cray environment, one can load modules to target architectures instead of adding compiler flags explicitly.

On Adastra’s accelerated nodes, we have AMD-Trento (host CPU) and AMD-MI250X (accelerator) as the two target architectures. The command module available craype- will show all the installed modules for available target architectures. For AMD-Trento the module is craype-x86-trento, for AMD-MI250X it would be craype-accel-amd-gfx90a and for MI300A it is craype-accel-amd-gfx942. These modules add environment variables used by the Cray compiler wrapper to trigger flags used by the compilers to optimize or produce code for these two architectures.

Warning

If you load a non-cpu target module, say craype-accel-amd-gfx90a, please do also load the *-mixed or toolchain module (rocm) associated to the target device, else you expose yourself to a debugging penance.

For example, to setup a MI250X GPU programming environment:

$ module purge
$ # A CrayPE environment version
$ module load cpe/24.07
$ # An architecture
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ # A compiler to target the architecture
$ module load PrgEnv-cray
$ # Some architecture related libraries and tools
$ module load amd-mixed

You get a C/C++/Fortran compiler configured to compile for Trento CPUs and MI250X GPUs and automatically link with the appropriate Cray MPICH release, that is, if you use the Cray compiler wrappers.

Warning

If you get a warning such as this one Load a valid targeting module or set CRAY_CPU_TARGET, it is probably because you did not load a craype-x86-<architecture> module.

Note

Try to always load, first, the CPU and GPU architectures (say, craype-x86-trento for the GENOA partition and craype-x86-trento, craype-accel-amd-gfx90a for the MI250 partition), then the PrgEnv and the rest of your modules.

Intra-process parallelization technologies

When you are not satisfied with the high level tools such as the vendor optimized BLAS, you have the option to program the machine by yourself. These technology are harder to use, more error prone but more versatile. Some technologies are given below, but the list is obviously not complete.

We could define at least two class of accelerator programming technologies. The ones based on directive (say, pragma omp parallel for) and the ones base on kernels. A kernel is a treatment, generally the inner loops or body of the inner loops of what you would write on a serial code. The kernel is given data to transform and is explicitly mapped to the hardware compute units.

Note

NVHPC is Nvidia’s GPU software stack, ROCm is AMD’s GPU software stack (amd-mixed or PrgEnv-amd), CCE is part of CPE which is Cray’s CPU/GPU compiler toolchain (PrgEnv-cray), LLVM is your plain old LLVM toolchain, OneAPI is Intel’s new CPU/GPU Sycl based software stack (contains the DPC++, aka Sycl compiler).

For C/C++ codes

Class

Name

Compiler support on AMD GPUs

Compiler support on Nvidia GPUs

Compiler support on Intel GPUs

Compiler support on x86 CPUs

Fine tuning

Implementation complexity/maintainability

Community support/availability (expected longevity in years)

Directive

OpenACC v2

GCC~

NVHPC/GCC~

NVHPC/GCC~

Low-medium

Low

Medium/high (+5 y)

OpenMP v5

CCE/LLVM

NVHPC/CCE/LLVM

OneAPI

GCC/LLVM/NVHPC/CCE/OneAPI

Low-medium

Low

High (+10 y)

Kernel

Sycl

AdaptiveCPP/OneAPI

AdaptiveCPP/OneAPI

AdaptiveCPP/OneAPI

AdaptiveCPP/OneAPI

High

Medium/high

High (+10 y)

CUDA/HIP

LLVM/CCE

NVHPC/LLVM/CCE

High

Medium/high

High (+10 y)

Kokkos

LLVM/AdaptiveCPP/OneAPI/CCE

NVHPC/LLVM/AdaptiveCPP/OneAPI/CCE

AdaptiveCPP/OneAPI

All

Medium/high

Low/medium

High (+10 y)

Sycl, the Khronos consortium’ successor to OpenCL is quite complex, like its predecessor. Obviously, time will tell if it is worth investing in this technology but there is a significant ongoing open standardization effort.

Kokkos in itself is not on the same level as OpenACC, OpenMP, Cuda/HIP or Sycl because it serves as an abstraction of all theses.

Note

Cray’s CCE, AMD’s ROCm, Intel’s OneAPI (intel-llvm) and LLVM’s Clang share the same front end (what reads the code). Most are just a recompiled/extended version Clang with, generally open source. Cray’ C/C++ compiler is a Clang compiler with a modified proprietary backend (code optimization and library such as the OpenMP backend implementation).

For Fortran codes

Class

Name

Compiler support on AMD GPUs

Compiler support on Nvidia GPUs

Compiler support on Intel GPUs

Compiler support on x86 CPUs

Fine tuning

Implementation complexity/maintainability

Community support/availability (expected longevity in years)

Directive

OpenACC v2

CCE/LLVM~/GCC~

NVHPC/CCE/LLVM~/GCC~

NVHPC/CCE/LLVM~/GCC~

Low-medium

Low

Medium/High (+5 y)

OpenMP v5

CCE/LLVM~/GCC~

NVHPC/CCE/LLVM~/GCC~

OneAPI

NVHPC/CCE/LLVM/GCC/OneAPI

Low-medium

Low

High (+10 y)

Kernel

AMD - Here, means the AMD stack, be it the AOCC compiler or the ROCm toolchain.
Intel - Here, means the Intel stack, be it the ICC compiler or the One API toolchain.

Some wrapper, preprocessor definitions, compiler and linker flags

A very thorough list of compiler flag meaning across different vendor is given in this document.

Flags conversion for Fortran program

Intel’s ifort

GNU’s gfortran

Cray’s crayftn

Note

-g

Embed debug info into the binary. Useful for stack trace and GDB.

-Og

-eD

Compile in debug mode. The crayftn option does a lot more than adding debug info though.

-O1

-O1

-O1

-O2

-O2

-O1

-O3

-O3

-O2

-fast

-Ofast

-O3

-xHost

-march=native

-h cpu=<> defined by the craype-* modules.

Careful, this flags assumes the machine on which you compile has similar CPUs to the one where your code run.

-integer-size 32

-s integer32

-integer-size 64

-fdefault-integer-8

-s integer64

-real-size 64

-fdefault-real-8

-s real64

-ftz

ieee_support_underflow_control/ ieee_set_underflow_mode

ieee_support_underflow_control/ ieee_set_underflow_mode

Flush denormal To Zero. If well designed, your code should not be very sensible to that. See the Fortran 2003 standard.

-convert big_endian

-fconvert=big-endian

-fpe0

-ffpe-trap=invalid,zero,overflow

~ -K trap=divz,inv,ovf

For debug build only.

-flto=thin

-hipa3

Link Time Optimization (LTO) sometime called InterProcedural Optimization (IPO) or IPA.

In case you use the GNU Fortran compiler and are subject to interface mismatch, use the -fallow-argument-mismatch flag. An interface mismatch, that is, when you pass arguments of different types to the same interface (subroutine) is not standard conforming Fortran code! Here is an excerpt of the GNU Fortran compiler manual: Some code contains calls to external procedures with mismatches between the calls and the procedure definition, or with mismatches between different calls. Such code is non-conforming, and will usually be flagged with an error. Using -fallow-argument-mismatch is strongly discouraged. It is possible to provide standard-conforming code which allows different types of arguments by using an explicit interface and TYPE(*).

Vectorizing for GCC and LLVM (clang) based compilers

To enabling vectorization of multiply/add operations and transcendental functions use -O3 -fno-math-errno -fno-trapping-math -ffp-contract=fast. Note that instruction amy also be reordered (((a+b)+c) may be rewritten to (a+(b+c)).

Some LLVM details are given in this document.

Given this simple C++ code:

#include <cmath>

void square(double*a) {
    a[0] = std::sqrt(a[0]);
    a[1] = std::sqrt(a[1]);
    a[2] = std::sqrt(a[2]);
    a[3] = std::sqrt(a[3]);
}

Without the above flags one would get this horrible code:

square(double*):
        push    rbx
        mov     rbx, rdi
        vmovsd  xmm0, qword ptr [rdi]
        vxorpd  xmm1, xmm1, xmm1
        vucomisd        xmm0, xmm1
        jb      .LBB0_2
        vsqrtsd xmm0, xmm0, xmm0
        vmovsd  qword ptr [rbx], xmm0
        vmovsd  xmm0, qword ptr [rbx + 8]
        vucomisd        xmm0, xmm1
        jae     .LBB0_4
.LBB0_5:
        call    sqrt@PLT
        jmp     .LBB0_6
.LBB0_2:
        call    sqrt@PLT
        vxorpd  xmm1, xmm1, xmm1
        vmovsd  qword ptr [rbx], xmm0
        vmovsd  xmm0, qword ptr [rbx + 8]
        vucomisd        xmm0, xmm1
        jb      .LBB0_5
.LBB0_4:
        vsqrtsd xmm0, xmm0, xmm0
.LBB0_6:
        vmovsd  qword ptr [rbx + 8], xmm0
        vmovsd  xmm0, qword ptr [rbx + 16]
        vxorpd  xmm1, xmm1, xmm1
        vucomisd        xmm0, xmm1
        jb      .LBB0_8
        vsqrtsd xmm0, xmm0, xmm0
        vmovsd  qword ptr [rbx + 16], xmm0
        vmovsd  xmm0, qword ptr [rbx + 24]
        vucomisd        xmm0, xmm1
        jae     .LBB0_10
.LBB0_11:
        call    sqrt@PLT
        vmovsd  qword ptr [rbx + 24], xmm0
        pop     rbx
        ret
.LBB0_8:
        call    sqrt@PLT
        vxorpd  xmm1, xmm1, xmm1
        vmovsd  qword ptr [rbx + 16], xmm0
        vmovsd  xmm0, qword ptr [rbx + 24]
        vucomisd        xmm0, xmm1
        jb      .LBB0_11
.LBB0_10:
        vsqrtsd xmm0, xmm0, xmm0
        vmovsd  qword ptr [rbx + 24], xmm0
        pop     rbx
        ret

Properly vectorized it would look like so:

square(double*):
        vsqrtpd ymm0, ymmword ptr [rdi]
        vmovupd ymmword ptr [rdi], ymm0
        vzeroupper
        ret

Debugging with crayftn

Note

To flush the output stream (stdout) is a standard way, use the output_unit named constant in the ISO_Fortran_env module. E.G.: flush(output_unit). This is useful when debugging using the classic print/comment approach.

Feature/flag/environment variable

Explanation

-eD

The -eD option enables all debugging options. This option is equivalent to specifying the -G0 option with the -m2, -rl, -R bcdsp, and -e0 options.

-e0

Initializes all undefined local stack, static, and heap variables to 0 (zero). If a user variable is of type character, it is initialized to NUL. If logical, initialized to false. The stack variables are initialized upon each execution of the procedure. When used in combination with -ei, Real and Complex variables are initialized to signaling NaNs, while all other typed objects are initialized to 0. Objects in common blocks will be initialized if the common block is declared within a BLOCKDATA program unit compiled with this option.

-ei

Initializes all undefined local stack, static, and heap variables of type REAL or COMPLEX to an invalid value (signaling NaN).

-en

Generates messages to note nonstandard Fortran usage.

-hfp0=noapprox

Controls the level of floating point optimizations, where n is a value between 0 and 4, with 0 giving the compiler minimum freedom to optimize floating point operations and 4 giving it maximum freedom. noapprox prevents rewrites of square root and divide expressions using hardware reciprocal approximations.

-hflex_mp=intolerant

Has the highest probability of repeatable results, but also the highest performance penalty.

-hlist=m

Produces a source listing with loopmark information. To provide a more complete report, this option automatically enables the -O negmsg option to show why loops were not optimized. If you do not require this information, use the -O nonegmsg option on the same command line. Loopmark information will not be displayed if the -d B option has been specified.

-hlist=a

Include all reports in the listing (including source, cross references, options, lint, loopmarks, common block, and options used during compilation).

-hbounds

Enable bound checking.

A typical set of debugging flag could be -eD -ei -en -hbounds -K trap=divz,inv,ovf.

crayftn also offers sanitizers which turn on runtime checks for various forms of undefined or suspicious behavior. This is an experimental feature (in CrayFTN 17). If a check fails, a diagnostic message is produced at runtime explaining the problem.

Feature/flag/environment variable

Explanation

-fsanitize=address

Enables a memory error detector.

-fsanitize=thread

Enables a data race detector.

Further reading: man crayftn.

Debugging with gfortran

A typical set of debugging flag could be -O1 -g -fcheck=all -ffpe-trap=invalid,zero,overflow -fbacktrace or -O1 -g -fcheck=all -ffpe-trap=invalid,zero,overflow -fbacktrace -finit-real=snan -finit-integer=42 -finit-logical=true -finit-character=0 (this set of option will silence -Wuninitialized).

Making the Cray wrappers spew their implicit flags

Assuming you have loaded an environment such as:

$ module purge
$ # A CrayPE environment version
$ module load cpe/24.07
$ # An architecture
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ # A compiler to target the architecture
$ module load PrgEnv-cray

The CC, cc and ftn Cray wrappers imply a lot of flags that you may want to retrieve. This can be done like so:

$ CC --cray-print-opts=cflags
-I/opt/cray/pe/libsci/24.07.0/CRAY/18.0/x86_64/include -I/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/include -I/opt/cray/pe/dsmml/0.3.0/dsmml/include
$ CC --cray-print-opts=libs
-L/opt/cray/pe/libsci/24.07.0/CRAY/18.0/x86_64/lib -L/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib -L/opt/cray/pe/mpich/8.1.30/gtl/lib -L/opt/cray/pe/dsmml/0.3.0/dsmml/lib -Wl,--as-needed,-lsci_cray_mpi,--no-as-needed -lmpi_gtl_hsa -Wl,--as-needed,-lsci_cray,--no-as-needed -ldl -Wl,--as-needed,-lmpi_cray,--no-as-needed -lmpi_gtl_hsa -Wl,--as-needed,-ldsmml,--no-as-needed -L/opt/cray/pe/cce/18.0.0/cce/x86_64/lib/pkgconfig/../ -Wl,--as-needed,-lstdc++,--no-as-needed -Wl,--as-needed,-lpgas-shmem,--no-as-needed -lfi -lquadmath -lmodules -lfi -lcraymath -lf -lu -lcsup

We observe the implied compile and link flags for Cray MPICH (the GTL is here too) and the LibSci. Had you used a cray-hdf5 or some other Cray modules libraries, it would have been in commands’ output.

Warning

The libs option return a list of linker flags containing instances of -Wl. This can create serious CMake confusion. For this reason, we recommend that you strip them away like so: CRAY_WRAPPER_LINK_FLAGS="$({ cc --cray-print-opts=libs; CC --cray-print-opts=libs; ftn --cray-print-opts=libs; } | tr '\n' ' ' | sed -e 's/-Wl,--as-needed,//g' -e 's/,--no-as-needed//g')".

Once you have extracted the flags for a given CPE version you can store them in a machine/toolchain file.

Say you use CMake, here is an example of what you could use the above for:

$ CRAY_WRAPPER_LINK_FLAGS="$({ cc --cray-print-opts=libs; CC --cray-print-opts=libs; ftn --cray-print-opts=libs; } | tr '\n' ' ' | sed -e 's/-Wl,--as-needed,//g' -e 's/,--no-as-needed//g')"
$ cmake \
    -DCMAKE_C_COMPILER=craycc \
    -DCMAKE_CXX_COMPILER=crayCC \
    -DCMAKE_Fortran_COMPILER=crayftn \
    -DCMAKE_C_FLAGS="$(cc --cray-print-opts=cflags)" \
    -DCMAKE_CXX_FLAGS="$(CC --cray-print-opts=cflags)" \
    -DCMAKE_Fortran_FLAGS="$(ftn --cray-print-opts=cflags)" \
    -DCMAKE_EXE_LINKER_FLAGS="${CRAY_WRAPPER_LINK_FLAGS}" \
    ..

Here we bypass all Cray wrappers (C/C++ and Fortran) and give CMake all the flags the wrapper would have implicitly added. This is clearly the recommanded way in case the wrapper causes you problems. We give multiple examples for compilers other than Cray in this document for a build of kokkos with a HIP and OpenMP CPU backend. the build is done using the Cray, amdclang++ or hipcc drivers. The above is transposable to other build system/generator than CMake.

Note

The Cray wrappers use -I and not -isystem which is suboptimal for strict code using many warning flags (as it should be).

Note

Use the -craype-verbose flag to display the command line produced by the Cray wrapper. This must be called on a file to see the full output (i.e., CC -craype-verbose test.cpp). You may also try the --verbose flag to ask the underlying compiler to show the command it itself launches.

crayftn optimization level details

Now we provide a list of the differences between the flags implicitly enabled when either -O1, -O2 or -O3. Understand that -O3 under the crayftn compiler is very aggressive and could be said to at least equate -Ofast under your typical Clang or GCC when it comes to the floating point optimizations.

Warning

Cray reserves the right to change, for a new crayftn version, the options enabled through -On.

The options given below are bound to Cray Fortran : Version 15.0.1. This may change with past and future versions.

O1 provides:

-h scalar1,vector1,unroll2,fusion2,cache0,cblock0,noaggress
-h ipa1,mpi0,pattern,modinline
-h fp2=approx,flex_mp=default,alias=default:standard_restrict
-h fma
-h autoprefetch,noconcurrent,nooverindex,shortcircuit2
-h noadd_paren,nozeroinc,noheap_allocate
-h align_arrays,nocontiguous,nocontiguous_pointer
-h nocontiguous_assumed_shape
-h fortran_ptr_alias,fortran_ptr_overlap
-h thread1,nothread_do_concurrent,noautothread,safe_addr
-h noomp -f openmp-simd
-h caf,noacc
-h nofunc_trace,noomp_analyze,noomp_trace,nopat_trace
-h nobounds
-h nomsgs,nonegmsgs,novector_classic
-h dynamic
-h cpu=x86-64,x86-trento,network=slingshot10
-h nofp_trap -K trap=none
-s default32
-d 0abcdefgijnpvxzBDEGINPQSZ
-e hmqwACFKRTX

The discrepancies shown between O1 and O2 are:

-h scalar2,vector2
-h ipa3
-h thread2

The discrepancies shown between O2 and O3 or Ofast are:

-h scalar3,vector3
-h ipa4
-h fp3=approx

AOCC flags

AMD gives a detailed description of the CPU optimization flags here: https://rocm.docs.amd.com/en/docs-5.5.1/reference/rocmcc/rocmcc.html#amd-optimizations-for-zen-architectures.

Understanding your compiler

GCC offers the following two flag combination that allows you to dig deeper into the default choices made to compiler for your architecture.

$ gcc -Q --help=target
$ # Works for clang too:
$ gcc -dM -E -march=znver3 - < /dev/null

Predefined preprocessor definitions

It can be useful to wrap code inside preprocessor control flow (ifdef). We provide some definitions that can help choose a path for workaround code.

Feature/flag/environment variable

Explanation

__INTEL_CLANG_COMPILER

For the C/C++ languages, the compiler is Intel’s new compiler, icl, a fork of clang.

__INTEL_COMPILER

For the C/C++ languages, the compiler is intel’s old compiler icc or its new icl.

__clang__=1

For the C/C++ languages, the compiler is Clang or one of its downstream fork.

__GNUC__=1

For the Fortran language, the compiler is GNU or a compiler mimicking it.

_MSC_VER=1

For the C/C++ languages, the compiler is Microsoft MSVC or a compiler mimicking it.

__cray__=1

For the C/C++ languages, the compiler is Cray (a superset of clang).

_CRAYFTN=1

For the Fortran language, the compiler is Cray.

Advanced tips and flags and environment variable for debugging

See LLVM Optimization Remarks by Ofek Shilon for more details on what Clang can tell you about how it optimizes you code and what tools are available to process that information.

Note

The crayftn compiler does not provide an option to trigger debug info generation while also, not lowering optimization.

Note

The crayftn compiler possess an extremely powerful optimizer which does of the most aggressive optimizations a compiler can afford to do. This means that using a high optimization level, the optimizer will assume your code has a strong standard compliance. Any slight deviation from the standard can lead to significant issue in the code, from crash to silent corruption. crayftn’s -O2 is considered stable, safe and comparable to the -O3 of other compilers. -hipa4 has lead to issues in some codes. crayftn also has his share of internal bugs which can mess up your code too.

Job submission

SLURM is the workload manager used to interact with the compute nodes on Adastra. In the following subsections, the most commonly used SLURM commands for submitting, running, and monitoring jobs will be covered, but users are encouraged to visit the official documentation and man pages for more information. This section describes how to run programs on the Adastra compute nodes, including a brief overview of SLURM and also how to map processes and threads to CPU cores and GPUs.

The SLURM batch scheduler and job launcher

SLURM provides multiple ways of submitting and launching jobs on Adastra’s compute nodes: batch scripts, interactive, and single-command. The SLURM commands allowing these methods are shown in the table below and examples of their use can be found in the related subsections. Please note that regardless of the submission method used, the job will launch on compute nodes, with the first node in the allocation serving as head-node.

With SLURM, you first ask for resources (a number of node, of GPU, of CPU) and then you distribute these resources on your tasks.

sbatch

Used to submit a batch script. The batch script can contain information on the amount of resources to allocate and how to distribute them. Options can be specified when specifying the sbatch command flags or inside the script, at the top of the file after the following prefix #SBATCH. The sbatch options do not necessarily lead to the resource distribution per rank that you would expect (!). sbatch allocates, srun distributes.
See Batch scripts for more details.
srun

Used to run a parallel job (job step) on the resources allocated with sbatch or salloc.
If necessary, srun will first create a resource allocation in which to run the parallel job(s).
salloc

Used to allocate an interactive SLURM job allocation, where one or more job steps (i.e., srun commands) can then be launched on the allocated resources (i.e., nodes).
See Interactive jobs for more details.

Batch scripts

A batch script can be used to submit a job to run on the compute nodes at a later time (the module used in the scripts below are here as an indication, you may not need them if you use PyTorch, Tensorflow or the CINES Spack modules). In this case, stdout and stderr will be written to a file(s) that can be opened after the job completes. Here is an example of a simple batch script for the GPU (MI250) partition:

 1#!/bin/bash
 2#SBATCH --account=<account_to_charge>
 3#SBATCH --job-name="<job_name>"
 4#SBATCH --constraint=MI250
 5#SBATCH --nodes=1
 6#SBATCH --exclusive
 7#SBATCH --time=1:00:00
 8
 9module purge
10
11# A CrayPE environment version
12module load cpe/24.07
13# An architecture
14module load craype-accel-amd-gfx90a craype-x86-trento
15# A compiler to target the architecture
16module load PrgEnv-cray
17# Some architecture related libraries and tools
18module load amd-mixed
19
20module list
21
22export MPICH_GPU_SUPPORT_ENABLED=1
23
24# export OMP_<ICV=XXX>
25
26srun --ntasks-per-node=8 --cpus-per-task=8 --threads-per-core=1 --gpu-bind=closest -- <executable> <arguments>

Here is an example of a simple batch script for the GPU (MI300A) partition:

 1#!/bin/bash
 2#SBATCH --account=<account_to_charge>
 3#SBATCH --job-name="<job_name>"
 4#SBATCH --constraint=MI300
 5#SBATCH --nodes=1
 6#SBATCH --exclusive
 7#SBATCH --time=1:00:00
 8
 9module purge
10
11# A CrayPE environment version
12module load cpe/24.07
13# An architecture
14module load craype-accel-amd-gfx942 craype-x86-genoa
15# A compiler to target the architecture
16module load PrgEnv-cray
17# Some architecture related libraries and tools
18module load amd-mixed
19
20module list
21
22export MPICH_GPU_SUPPORT_ENABLED=1
23
24# export OMP_<ICV=XXX>
25
26srun --ntasks-per-node=4 --cpus-per-task=24 --threads-per-core=1 --gpu-bind=closest -- <executable> <arguments>

Here is an example of a simple batch script for the CPU (GENOA) partition:

 1#!/bin/bash
 2#SBATCH --account=<account_to_charge>
 3#SBATCH --job-name="<job_name>"
 4#SBATCH --constraint=GENOA
 5#SBATCH --nodes=1
 6#SBATCH --exclusive
 7#SBATCH --time=1:00:00
 8
 9module purge
10
11# A CrayPE environment version
12module load cpe/24.07
13# An architecture
14module load craype-x86-genoa
15# A compiler to target the architecture
16module load PrgEnv-cray
17
18module list
19
20
21
22
23
24# export OMP_<ICV=XXX>
25
26srun --ntasks-per-node=24 --cpus-per-task=8 --threads-per-core=1 -- <executable> <arguments>

Assuming the file is called job.sh on the disk, you would launch it like so: sbatch job.sh.

Options encountered after the first non-comment line will not be read by SLURM. In the example script, the lines are:

Line

Description

1

Shell interpreter line.

2

GENCI/DARI project to charge. More on that below.

3

Job name.

4

Type of Adastra node requested (here, the GPU MI250 or CPU GENOA partition).

5

Number of compute nodes requested.

6

Ask SLURM to reserve whole nodes. If this is not wanted, see Shared mode vs exclusive mode.

7

Specify where the stderr and stdout streams should be saved to disk.

8

Wall time requested (HH:MM:SS).

10-19

Setup the module environment, always starting with a purge.

21

(For the MI250/MI300 partition script) Enable GPU aware MPI. You can pass GPU buffers directly to the MPI APIs.

25

Potentially, setup some OpenMP environment variables.

27

Implicitly ask to use all of the node allocated. Then we distribute the work on 8 or 24 tasks per node. We also specify that the tasks should be bound to 8 cores, without Simultaneous Multithreading (SMT) and to the closest GPU to these 8 cores.

The SLURM submission options are preceded by #SBATCH, making them appear as comments to a shell (since comments begin with #). SLURM will look for submission options from the first line through the first non-comment line. The mandatory SLURM flags are, the account identifier (also called project ID or project name and specified via --account=), more on that later, the type of node (via --constraint=), the maximal job runtime duration (via --time=) and the number of nodes (via --nodes=).

Some more advanced scripts are available in this document and this repository (though, the scripts of this repository are quite old).

Warning

A proper binding is often critical for HPC applications. We strongly recommend that you either make sure your binding is correct (say, using this tool hello_cpu_binding) or that you take a look at the binding scripts presented in Proper binding, why and how.

Note

The binding srun does is only able to restrict a rank to a set of thread (process affinity towards hardware threads). It does not do what is called thread pinning/affinity. To exploit thread pinning, you may want to check OpenMP’s ${OMP_PROC_BIND} and ${OMP_PLACES} Internal Control Variables (ICVs)/environment variables. Bad thread pinning can be detrimental to performance. Check this document for more details.

The typical OpenMP ICVs used prevent and diagnose thread affinity issues rely on the following environment variable:

# Logs the rank to core/thread placement is correct.
export OMP_DISPLAY_AFFINITY=TRUE
export OMP_PROC_BIND=CLOSE
export OMP_PLACES=THREADS
# This should be redundant because srun already restrict the rank's CPU
# access.
export OMP_NUM_THREADS=<N>

Shared mode

In shared mode, you will have to share resources with other users. That means you will likely be using a node with someone else (and potentially, suffer some performance degradation).

Note

For both CPU and GPU partitions, we recommend 3 additional sbatch flags: --ntasks-per-node=, --cpus-per-task= and --threads-per-core=. These describe the “sub node” amount of resource you whish to consume.

Note

For the GPU partitions, you also have to specify --gpus-per-node=.

Here is an example of a simple batch script for the GPU (MI250) partition, 2 tasks, 1 GPU and 8 threads (using 8 cores and no hyperthreading) per task:

#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="<job_name>"
#SBATCH --constraint=MI250
#SBATCH --nodes=1
# #SBATCH --exclusive # Shared !
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-node=2
#SBATCH --cpus-per-task=8
#SBATCH --threads-per-core=1
#SBATCH --time=1:00:00

module purge

# A CrayPE environment version
module load cpe/24.07
# An architecture
module load craype-accel-amd-gfx90a craype-x86-trento
# A compiler to target the architecture
module load PrgEnv-cray
# Some architecture related libraries and tools
module load amd-mixed

module list

export MPICH_GPU_SUPPORT_ENABLED=1

# export OMP_<ICV=XXX>

srun --ntasks-per-node=2 --cpus-per-task=8 --threads-per-core=1 --gpu-bind=closest -- <executable> <arguments>

Here is an example of a simple batch script for the CPU (GENOA) partition, 20 tasks, 16 threads (using 8 cores and hyperthreading) per task:

#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="<job_name>"
#SBATCH --constraint=GENOA
#SBATCH --nodes=1
# #SBATCH --exclusive # Shared !
#SBATCH --ntasks-per-node=20
#SBATCH --cpus-per-task=16 # 16 threads or 16/2 cores.
#SBATCH --threads-per-core=2
#SBATCH --time=1:00:00

module purge

# A CrayPE environment version
module load cpe/24.07
# An architecture
module load craype-x86-genoa
# A compiler to target the architecture
module load PrgEnv-cray

module list






# export OMP_<ICV=XXX>

srun --ntasks-per-node=20 --cpus-per-task=16 --threads-per-core=2 -- <executable> <arguments>

Common SLURM submission options

The table below summarizes commonly-used SLURM job submission options:

Command (long or short)

Description

--account=<account_to_charge> or -A <account_to_charge>

Account identifier (also called project ID) to use and charge for the compute resources consumption. More on that below.

--constraint=<node_type>

Type of Adastra node. The accepted values are MI250, GENOA and HPDA. The first two values represent the two main partitions of Adastra.

--time=<maximum_duration> or -t <maximum_duration>

Maximum duration as wall clock time HH:MM:SS.

--nodes=<number_of_nodes> or -N <number_of_nodes>

Number of compute nodes.

--job-name="<job_name>" or -J <job_name>

Name of job.

--output=<file_name> or -o <file_name>

Standard output file name.

--error=<file_name> or -e <file_name>

Standard error file name.

For more information about these or other options, please see the sbatch man page.

Resource consumption and charging

French computing site resources are represented in hour of use of a given resource type. For instance, at CINES, if you have been given 100’000 hours on Adastra’s MI250X partition, it means that you could use a single unit of MI250X resource for 100’000 hours. It also means that you could use 400 units of MI250X resource for 250 hours. The units are given below:

Computing resource

Unit description

Example

MI250X partition

2 GCD (GPU device) of an MI250X card, that is, a whole MI250X.

1 hour on a MI250X node (exclusive) = 4 MI250X hours.

MI300A partition

1 GPU device.

1 hour on a MI300A node (exclusive) = 4 MI300A hours.

GENOA partition

1 core (2 logical threads).

1 hour on a GENOA node (exclusive) = 192 GENOA core hours.

Warning

Due to an historical mistake, the eDARI website uses a unit for the MI250X partition that is a whole MI250X instead of the GCDs which is a half of an MI250X. If you ask for 50 MI250X hours on eDARI, you can in practice, use 100 MI250X GCD hours.

The resources you will consume will have to be charged to a project. Multiple times in this document have we invoked the --account=<account_to_charge> SLURM flag. Before submitting the job, make sure you have set a valid <account_to_charge>. You can obtain the list of account you are attached to by running the myproject -l command. The values representing the account name you can charge are on the last line of the command output (i.e.: Liste des projets de calcul associés au user someuser : ['bae1234', 'eat4567', 'afk8901']). More on myproject in the Login unique section.

We do not charge for HPDA resources.

In addition, the <constraint> in --constraint=<constraint> should be set to a proper value as it is this SLURM flag that describes the kind of resource you will request and thus, that CINES will charge.

Note

To monitor your compute hours consumption, use the myproject --state [project] command or visit https://reser.cines.fr/.

Warning

The charging gets a little bit less simple when you use the shared nodes.

Shared mode vs exclusive mode

Some nodes are reserved for what we call share mode, which differs from the exclusive mode found in many of the batch scripts presented in this documentation (observe the --exclusive SLURM flag). The role of theses shared nodes is the following: when a resource allocation, be it through salloc or sbatch, asks for less than what a whole node offers, your allocation will automatically be rerouted to the pool of shared nodes (the [genoa|mi250]-shared SLURM partitions). On these, you may have to share the node with other users. That said, we maintain isolation of the resources, one job can not access the resources (GPUs/cores/memory) allocated by an other job (or other user), even if both jobs reside on the same node. For now, the shared mechanism is available on some nodes of both the MI250 and GENOA partitions and active on the whole HPDA partition.

On the GENOA and MI250 partitions, the smallest amount of core you can get charged for is 8. This is so that we map an allocation to a hardware resource (L3 cache or NUMA nodes), limiting the impact on users sharing the same node.

Note

Conceptually, if you ask for say, 2/4 of a node’s CPU cores and for 3/4 of a node’s memory we will charge you for 3/4 of a node’s CPU cores. In this case, note that you will not have access to 3/4 of the CPU cores even though we charge for it. To specify the amount of memory per node use --mem=<N>G where <N> is an amount in Gio. The same logic applies to GPUs.

Lets have a practical example, say you want to schedule a very small job running 2 tasks, each using 8 cores of the GENOA partition. You could reserve a whole node via --exclusive but CINES would then charge you for all the 192 cores of the node, effectively wasting 176 cores; instead, to benefit from the shared nodes, you will not use the --exclusive flag:

$ salloc --account=<account_to_charge> --job-name="interactive" --constraint=GENOA --nodes=1 --ntasks-per-node=2 --cpus-per-task=8 --time=1:00:00
salloc: INFO : Considering its requirements, this job is treated in SHARED mode.
salloc: INFO : We cannot guarantee the performance reproducibility of such small jobs in this mode,
salloc: INFO : but they are only charged for the needed resources.
salloc:
salloc: INFO : As you didn't ask threads_per_core in your request: 2 was taken as default
salloc: INFO : This job requests 8 cores. Due to shared usage this job will be charged for 8 cores (of the 192 cores of the node)
salloc: Pending job allocation 41328
salloc: job 41328 queued and waiting for resources
salloc: job 41328 has been allocated resources
salloc: Granted job allocation 41328

Note the message written to your shell. We see that 8 cores will be allocated and charged, instead of the 192. Also, the shared mechanism informs us that we will have 2 threads per core (called hyperthreading or SMT) for a total of 16 threads available to your program. This may not be what one wants, indeed core and thread are different! In the textual description above, above we wanted 8 cores per rank, not 8 threads. We thus have to add the --threads-per-core=1 SLURM flag, giving:

$ salloc --account=<account_to_charge> --job-name="interactive" --constraint=GENOA --nodes=1 --ntasks-per-node=2 --cpus-per-task=8 --time=1:00:00 --threads-per-core=1
salloc: INFO : Considering its requirements, this job is treated in SHARED mode.
salloc: INFO : We cannot guarantee the performance reproducibility of such small jobs in this mode,
salloc: INFO : but they are only charged for the needed resources.
salloc:
salloc: INFO : This job requests 16 cores. Due to shared usage this job will be charged for 16 cores (of the 192 cores of the node)
salloc: Pending job allocation 41342
salloc: job 41342 queued and waiting for resources
salloc: job 41342 has been allocated resources
salloc: Granted job allocation 41342

This time, we observe the 16 cores allocated corresponding to the number of task per node times the number of thread (one per core) per rank.

If we now seek to use hyperthreading (SMT), it would look like so:

$ salloc --account=<account_to_charge> --job-name="interactive" --constraint=GENOA --nodes=1 --ntasks-per-node=2 --cpus-per-task=16 --time=1:00:00 --threads-per-core=2
salloc: INFO : Considering its requirements, this job is treated in SHARED mode.
salloc: INFO : We cannot guarantee the performance reproducibility of such small jobs in this mode,
salloc: INFO : but they are only charged for the needed resources.
salloc:
salloc: INFO : This job requests 16 cores. Due to shared usage this job will be charged for 16 cores (of the 192 cores of the node)
salloc: Pending job allocation 41342
salloc: job 41342 queued and waiting for resources
salloc: job 41342 has been allocated resources
salloc: Granted job allocation 41342

Note that we double the number of thread per core and per rank.

Finally, understand that we recommend that you prepare your work so that it uses full nodes (exclusive mode). Then, all the charging complexity goes away:

$ salloc --account=<account_to_charge> --job-name="interactive" --constraint=GENOA --nodes=1 --ntasks-per-node=2 --cpus-per-task=8 --time=1:00:00 --exclusive
salloc: Pending job allocation 41382
salloc: job 41382 queued and waiting for resources
salloc: job 41382 has been allocated resources
salloc: Granted job allocation 41382

You can find ways to pack small amounts of work into one large node allocation in this document.

Warning

If a shared partition is completely exhausted your job will be pending. It may be that if you switch to exclusive mode, your job will start earlier due to the pool of non-shared node being less used. This differs from say, IDRIS’ JeanZay machine where all the GPU nodes are shared. We assume HPC is synonymous of large job, spanning and scaling over multiple nodes. The shared mode is for debugging purposes, code that do not scale and computation of short duration (some script or post processing). If you run many small jobs that can, put together, fill a whole node, you should use a whole node, not a shared one; check this document to learn how.

Shared mode charging formula

On the GENOA partition, the memory and core charging computation is roughly defined like that:

CORES_CHARGED = CEIL_TO_MULTIPLE[ MAX((MEMORY_PER_NODE_ASKED/MEMORY_PER_NODE)*CORE_PER_NODE; CORE_PER_NODE_ASKED) ; CORE_PHYSICAL_BOUNDARY]

With:
    MEMORY_PER_NODE_ASKED and CORE_PER_NODE_ASKED being the resource amount you wish to reserve.
    CORE_PHYSICAL_BOUNDARY=8 is a group a of core we do not whish to split (L3 cache).
    MEMORY_PER_NODE~=744 and CORE_PER_NODE=192 the amount of resource a node has.
    CEIL_TO_MULTIPLE[X; N] = ROUND_AWAY_FROM_ZERO(X/N)*N

Additionally to the core and the memory, the GPU nodes offer, well, GPUs. On theses nodes, we charge you, not CPUs hours, but GPUs hours. See the Resource consumption and charging document.

On the MI250 partition, this means that we use this formula instead, given the number of GPU to charge for:

GPUS_CHARGED = CEIL_TO_MULTIPLE[ MAX((MEMORY_PER_NODE_ASKED/MEMORY_PER_NODE)*CORE_PER_NODE; CORE_PER_NODE_ASKED; GPU_PER_NODE_ASKED*CORE_PHYSICAL_BOUNDARY) ; CORE_PHYSICAL_BOUNDARY] / CORE_PHYSICAL_BOUNDARY

With:
    MEMORY_PER_NODE_ASKED and CORE_PER_NODE_ASKED being the resource amount you wish to reserve.
    CORE_PHYSICAL_BOUNDARY=8 is a group a of core we do not whish to split (GPU NUMA).
    MEMORY_PER_NODE~=232 and CORE_PER_NODE=64 the amount of resource a node has.
    CEIL_TO_MULTIPLE[X; N] = ROUND_AWAY_FROM_ZERO(X/N)*N
    GPU_PER_NODE_ASKED being the resource amount you wish to reserve.

Note

We allocate MI250X GCDs on the basis of 8 cores per GCD. So if you ask for 15 cores, we’ll charge you for 2 GCDs.

Now that we have the amount of resource (GPUs or CPU cores), we multiply that be the duration it is allocated for. If you job finishes early, we do not charge for the unused time.

Quality Of Service (QoS) queues

CINES respects the SLURM scheduler fair share constraints described by GENCI and common to CINES, IDRIS and TGCC.

On Adastra, queues are transparent, CINES does not publicizes the QoS. The user should not try to specify anything related to that subject (such as --qos=). The SLURM scheduler will automatically place your job in the right QoS depending on the duration and resource quantity asked.

Queue priority rules are harmonized between the 3 computing centers (CINES, IDRIS and TGCC). We give a higher priority is given to large jobs, as Adastra is primarily dedicated to running large HPC jobs. The SLURM fairshare concept is up and running meaning that a user that assuming a linear consumption, if a user is above the line, its priority will be lower than a user ho is below the line. We may artificially lower a user’s priority if we notice bad practices (such as launching thousands of small jobs on an HPC machine). Priorities are calculated over a sliding window of one week. With a little patience, your job will eventually be processed.

The best advice we can give you is to correctly size your jobs. First check which node configuration works best, adjust the number of MPI, OpenMP thread and binding on a single node. Then do some scaling tests. Finally, do not specify a SLURM ``–time`` argument larger than what you really need, this is the most common scheduler misconfiguration on the user’s side.

srun

The default job launcher for Adastra is srun . The srun command is used to execute an MPI enabled binary on one or more compute nodes in parallel. It is responsible for distributing the resources allocated by an salloc or sbatch command onto MPI ranks.

 $ # srun  [OPTIONS... [executable [args...]]]
 $ srun --ntasks-per-node=24 --cpus-per-task=8 --threads-per-core=1 -- <executable> <arguments>
<output printed to terminal>

The output options have been removed since stdout and stderr are typically desired in the terminal window in this usage mode.

srun accepts the following common options:

-N, --nodes

Number of nodes

-n, --ntasks

Total number of MPI tasks (default is 1).

-c, --cpus-per-task=<ncpus>


Logical cores per MPI task (default is 1).
When used with --threads-per-core=1: -c is equivalent to physical cores per task.
We do not advise that you use this option when using --cpu-bind=none.
--cpu-bind=threads

Bind tasks to CPUs.
threads - (default, recommended) Automatically generate masks binding tasks to threads.
--threads-per-core=<threads>



In task layout, use the specified maximum number of hardware threads per core.
(default is 2; there are 2 hardware threads per physical CPU core).
Must also be set in salloc or sbatch if using --threads-per-core=2 in your srun command.
threads-per-core should always be used instead of hint=nomultithread `` or ``hint=multithread.

--kill-on-bad-exit=1

Try harder at killing the whole step if a process fails and return an error code different than 1.

-m, --distribution=<value>:<value>:<value>


Specifies the distribution of MPI ranks across compute nodes, sockets (L3 regions), and cores, respectively.
The default values are block:cyclic:cyclic, see man srun for more information.
Currently, the distribution setting for cores (the third <value> entry) has no effect on Adastra
--ntasks-per-node=<ntasks>

If used without -n: requests that a specific number of tasks be invoked on each node.
If used with -n: treated as a maximum count of tasks per node.

--gpus

Specify the number of GPUs required for the job (total GPUs across all nodes).

--gpus-per-node

Specify the number of GPUs per node required for the job.

--gpu-bind=closest

Binds each task to the GPU which is on the same NUMA domain as the CPU core the MPI rank is running on.
See the --gpu-bind=closest example in Proper binding, why and how for more details.
--gpu-bind=map_gpu:<list>





Bind tasks to specific GPUs by setting GPU masks on tasks (or ranks) as specified where
<list> is <gpu_id_for_task_0>,<gpu_id_for_task_1>,.... If the number of tasks (or
ranks) exceeds the number of elements in this list, elements in the list will be reused as
needed starting from the beginning of the list. To simplify support for large task
counts, the lists may follow a map with an asterisk and repetition count. (For example
map_gpu:0*4,1*4).

--ntasks-per-gpu=<ntasks>

Request that there are ntasks tasks invoked for every GPU.

--label

Prefix every written lines from stderr or stdout with <rank index>: where <rank index> starts at zero
and matches the MPI rank index that the writing process is.

Interactive jobs

Most users will find batch jobs as an easy way to use the system. Indeed, they allow the user to hand off a job to the scheduler, allowing the user to focus on other tasks while the job waits in the queue and eventually runs. Occasionally, it is necessary to run interactively, especially when developing, testing, modifying or debugging a code.

Since all compute resources are managed and scheduled by SLURM, it is not possible to simply log into the system and immediately begin running parallel codes interactively. Rather, you must request the appropriate resources from SLURM and, if necessary, wait for them to become available. This is done through an interactive batch job. Interactive batch jobs are submitted with the salloc command. Resources are requested via the same options that are passed via #SBATCH in a regular batch script (but without the #SBATCH prefix). For example, to request an interactive batch job with MI250 resources, you would use salloc --account=<account_to_charge> --constraint=MI250 --job-name="<job_name>" --nodes=1 --time=1:00:00 --exclusive. Note that there is no option for an output file, you are running interactively, so standard output and standard error will be displayed to the terminal.

You can then run the command you would generally put in the batch script: srun --ntasks-per-node=2 --cpus-per-task=8 --threads-per-core=1 --gpu-bind=closest -- <executable> <arguments>.

If you want to connect to the node, you can directly ssh on it, assuming you have allocated it.

You can also start a shell environment as a SLURM step (which on some machines is the only way to get interactive node access): srun --pty -- "${SHELL}".

Small job

Allocating a single GPU

The line below will allocate 1 GPU and 8 cores (no SMT), for 60 minutes.

$ srun \
      --account=<account_to_charge> \
      --constraint=MI250 \
      --nodes=1 \
      --time=1:00:00 \
      --gpus-per-node=1 \
      --ntasks-per-node=1 \
      --cpus-per-task=8 \
      --threads-per-core=1 \
      -- <executable> <arguments>

Note

This is more of a hack than a serious usage of SLURM concepts or of HPC resources.

Packing

Note

We strongly advise that you get familiar with Adastra’s SLURM’s queuing concepts.

If your workflow consist of many small jobs, you may rely on the shared mode. That said, if you run many small jobs that can, put together, fill a whole node, you should use a whole node, not a shared one. This may shorten your queue time as we have and want to keep a small shared node count.

This is how we propose you use a whole node:

#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="<job_name>"
#SBATCH --constraint=GENOA
#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --time=1:00:00

set -eu
set -x

# How many run your logic needs.
STEP_COUNT=128

# due to the parallel nature of the SLURM steps described below, we need a
# way to properly log each one of them. See the:
# 2>&1 | tee "StepLogs/${SLURM_JOBID}.${I}"
mkdir -p StepLogs

for ((I = 0; I < STEP_COUNT; I++)); do
    srun --exclusive --nodes=2 --ntasks-per-node=3 --cpus-per-task="4" --threads-per-core=1 --label \
        -- ./work.sh 2>&1 | tee "StepLogs/${SLURM_JOBID}.${I}" &
done

# We started STEP_COUNT steps AKA srun processes, wait for them.
wait

In the script above, the steps will all be initiated but will start only when enough resource is available on the set of allocated resources (here we asked for 4 nodes). Here work.sh represent your workload. This workload command would be executed as many times as STEP_COUNT*nodes*ntask-per-node=128*2*3=768 each with 4 cores. SLURM will automatically fill the resource allocated (here 4 nodes), queue and start as needed.

Chained job

SLURM offers a feature allowing the user to chain job. The user can, in fact, define a dependency graph of the jobs.

As an example, we want to start a job represented by my_first_job.sh and start an other job my_second_job.sh which we want to start only when my_first_job.sh finished:

$ sbatch my_first_job.sh
Submitted batch job 189562
$ sbatch --dependency=afterok:189562 my_second_job.sh
Submitted batch job 189563
$ sbatch --dependency=afterok:189563 my_other_job.sh
Submitted batch job 189564

In this example we use the afterok trigger meaning that only if the parent job ends successfully (exit code 0) will it start.

You will then see something like this in squeue:

$ squeue --me
JOBID  PARTITION NAME USER ST TIME NODES NODELIST(REASON)
189562 mi250     test bae  PD 0:00 1     (Dependency)
189563 mi250     test bae  PD 0:00 1     (Dependency)
189564 mi250     test bae  R  0:04 1     g1057

Note the Dependency status.

You can replace afterok by after, afterany, afternotok or singleton. More information here: https://slurm.schedmd.com/sbatch.html#OPT_dependency

Job array

Warning

If you launch job arrays, ensure that they do not contain more that 128 jobs or you will get an error related to AssocMaxSubmitJobLimit.

Other common SLURM commands

The table below summarizes commonly-used SLURM commands:

sinfo

Used to view partition and node information.
i.e., to view user-defined details about the batch queue:
sinfo -p batch -o "%15N %10D %10P %10a %10c %10z"

squeue

Used to view job and job step information for jobs in the scheduling queue.
i.e., to see your own jobs:
squeue -l --me

sacct

Used to view accounting data for jobs and job steps in the job accounting log (currently in the queue or recently completed).
i.e., to see a list of specified information about all jobs submitted/run by a users since 1 PM on January 4, 2023:
sacct -u <login> -S 2023-01-04T13:00:00 -o "jobid%5,jobname%25,user%15,nodelist%20" -X

scancel

Used to signal or cancel jobs or job steps.
i.e., to cancel a job:
scancel <job_id>

We describe some of the usage of these commands below in Monitoring and modifying batch jobs.

Job state

A job will transition through several states during its lifetime. Common ones include:

State
Code
State

Description

CA

Canceled

The job was canceled (could’ve been by the user or an administrator).

CD

Completed

The job completed successfully (exit code 0).

CG

Completing

The job is in the process of completing (some processes may still be running).

PD

Pending

The job is waiting for resources to be allocated.

R

Running

The job is currently running.

Job reason codes

In addition to state codes, jobs that are pending will have a reason code to explain why the job is pending. Completed jobs will have a reason describing how the job ended. Some codes you might see include:

Reason

Meaning

Dependency

Job has dependencies that have not been met.

JobHeldUser

Job is held at user’s request.

JobHeldAdmin

Job is held at system administrator’s request.

Priority

Other jobs with higher priority exist for the partition/reservation.

Reservation

The job is waiting for its reservation to become available.

AssocMaxJobsLimit

The job is being held because the user/project has hit the limit on running jobs.

AssocMaxSubmitJobLimit

The limit on the number of jobs a user is allowed to have running or pending at a given time has been met for the requested association (array).

ReqNodeNotAvail

The user requested a particular node, but it is currently unavailable (it is in use, reserved, down, draining, etc.).

JobLaunchFailure

Job failed to launch (could due to system problems, invalid program name, etc.).

NonZeroExitCode

The job exited with some code other than 0.

Many other states and job reason codes exist. For a more complete description, see the squeue man page (either on the system or online).

More reasons are given in the official SLURM documentation.

Monitoring and modifying batch jobs

scancel: Cancel or signal a job

SLURM allows you to signal a job with the scancel command. Typically, this is used to remove a job from the queue. In this use case, you do not need to specify a signal and can simply provide the jobid. For example, scancel 12345.

In addition to removing a job from the queue, the command gives you the ability to send other signals to the job with the -s option. For example, if you want to send SIGUSR1 to a job, you would use scancel -s USR1 12345.

squeue: View the job queue

The squeue command is used to show the batch queue. You can filter the level of detail through several command-line options. For example:

squeue --long

Show all jobs currently in the queue.

squeue --long --me

Show all of your jobs currently in the queue.

squeue --me --start

Show all of your jobs that have yet to start and show their expected start time.

sacct: Get job accounting information

The sacct command gives detailed information about jobs currently in the queue and recently-completed jobs. You can also use it to see the various steps within a batch jobs.

sacct -a -X

Show all jobs (-a) in the queue, but summarize the whole allocation instead of showing individual steps (-X).

sacct -u ${USER}

Show all of your jobs, and show the individual steps (since there was no -X option).

sacct -j 12345

Show all job steps that are part of job 12345.

sacct -u ${USER} -S 2022-07-01T13:00:00 -o "jobid%5,jobname%25,nodelist%20" -X

Show all of your jobs since 1 PM on July 1, 2022 using a particular output format.

scontrol show job: Get Detailed Job Information

In addition to holding, releasing, and updating the job, the scontrol command can show detailed job information via the show job subcommand. For example, scontrol show job 12345.

Note

scontrol show job can only report information on a job that is in the queue. That is, pending or running (but there are more states). A finished job is not in the queue and not queryable with scontrol show job.

Obtaining the energy consumption of a job

On Adastra, we enable the user to monitor the energy his job consumes.

$ sacct --format=JobID,ElapsedRaw,ConsumedEnergyRaw,NodeList --jobs=<job_id>
JobID          ElapsedRaw ConsumedEnergyRaw        NodeList
-------------- ---------- ----------------- ---------------
<job_id>              104          12934230 c[1000-1003,10+
<job_id>.batch        104             58961           c1000
<job_id>.0             85          12934230 c[1000-1003,10+

The user obtains, for a given <job_id>, the elapsed time in secondes and the energy consumption in joules for the whole job, the execution of the batch script and for each job steps. The job steps are suffixed with \.[0-9]+ (in regex form).

Each time you execute the srun comment in a batch script, it creates a new job step. Here, there is only one srun step which took 85 secondes and 12934230 joules.

Note

The duration of the step as reported by SLURM is not reliable for a short step. There may be an additional ~10 secondes.

Note

You will only get meaningful values regarding a job step once the job step has ended.

Note

The energy returned represents the aggregated node consumption. We do not include the network and storage costs as these ones are trickier to get and consist in a near fixed cost anyway (that is, whether you run are not your code).

Note

Some compute node may not return an energy consumed value. This leads to a value of 0 or empty under ConsumedEnergyRaw. To workaround the issue, one can use the following command: scontrol show node | grep -e "CurrentWatts=n/s" -e "CurrentWatts=0" -B15 | grep "NodeName=" | cut -d '=' -f 2 | awk '{print $1}' | tr '\n' ',' and feed the result to the SLURM commands’ --exclude= option. For instance: sbatch --exclude="$(scontrol show node | grep -e "CurrentWatts=n/s" -e "CurrentWatts=0" -B15 | grep "NodeName=" | cut -d '=' -f 2 | awk '{print $1}' | tr '\n' ',')" job.sh.

Note

The counters SLURM uses to compute the energy consumption are visible in the following files: /sys/cray/pm_counters/*.

Coredump files

If you start a program through our batch scheduler (SLURM), and if your program crashes, you will find your coredump files in the ${SCRATCHDIR}/COREDIR/<job_id>/<hostname> directory. The ${SCRATCHDIR} correspond to the scratch directory associated to your user and project specified in the #SBATCH --account=<account_to_charge> batch script option. The files are stored in different folders depending on the <job_id>. Additionally, if your job ran on multiple nodes, it is useful to have a way to differentiate which coredump file originate from which node, thus, the <hostname> of the node is used to define a path for the coredump files.

The coredump filename has the following semantic: core_<signal>_<timestamp>_<process_name>.dump.<process_identifier> (the equivalent core pattern being core_%s_%t_%e.dump). As an example, you could have such coredump filename:

core_11_1693986879_segfault_testco.dump.2745754

You can then exploit a coredump file by using tools such as GDB like so:

$ gdb ./path/to/program.file ./path/to/coredump.file

You can find more information on GDB and coredump files here.

Warning

Be careful that you do not fill all your scratch space quota with coredump files. Notably, if you run a large job that crashes.

Note

On Adastra, ulimit -c unlimited is the default. The coredump placement to scratch works on the HPDA, MI250 and GENOA partitions. To deactivate the core dumping, run the following command in, say, your batch script: ulimit -c 0.

Note

Use gcore <pid> to explicitly generate a core file of a running program.

Warning

For the placement of the coredump to the scratch to work, one needs to use either a batch script or the salloc + srun commands. Simply allocating (salloc) and ssh ing to the node will not properly configure the coredump placement mechanism. Also, one needs to request nodes in exclusive mode for the placement to work (in shared mode it will not work).