Accessing Adastra
This document is a quick start guide for the Adastra machine. You can find additional information on the GENCI’s website and in this booklet
.
Account opening
To access Adastra you need to have an account on the Demande d’Attribution de Ressources Informatique (DARI)’s website. Then, on eDARI, you need to ask to be associated to a research project with attributed Adastra compute hours. Following that, you can ask on eDARI for your personal account to be created on the machine (Adastra in this context). You will have to fill in a form, which to be valid, needs the three parties below have dated and electronically signed your account request:
The person who made the request;
the user’s security representative (often related to his laboratory);
the laboratory director.
You will then receive, via email, the instructions containing your credentials.
Connecting
To connect to Adastra, ssh
to adastra.cines.fr
.
$ ssh <login>@adastra.cines.fr
Warning
Authenticating to Adastra using ssh keys is not permitted. You will have to enter your password.
To connect to a specific login node, use:
$ ssh <login>@adastra<login_node_number>.cines.fr
Where <login_node_number>
represents a integer login node identifier. For instance, ssh anusername@adastra5.cines.fr
will connect you to the login number 5.
X11 forwarding
Automatic forwarding of the X11 display to a remote computer is possible with the use of SSH and a local (i.e., on your desktop) X server. To set up automatic X11 forwarding within SSH, you can do one of the following:
Invoke ssh
with -X
:
$ ssh -X <login>@adastra.cines.fr
Note that use of the -x
flag (lowercase) will disable X11 forwarding. Users should not manually set the ${DISPLAY}
environment variable for X11 forwarding.
Warning
If you have issues when launching a GUI application, make sure this is not related to the .Xauthority
file. If it is, or you are not sure it is, checkout the .Xauthority file document.
Login unique
The login unique (in english, single sign on or unique login) is a new feature of the CINES’ supercomputer that will enable a user to work on multiple projects using a single, unique login. These logins (also called username) will be valid the lifetime of the machine (though the data may not, see Quotas for more details). This simplifies authentication over time. This procedure is already used in the other two national centres (IDRIS and TGCC). The method for logging into the machine remains the same as before and as described above. Once you are logged in, you get access to one of your home directory which is the home associated to your current project (if you have one). At this stage, you can adapt your environment to the project you wish to work on with the help of the command myproject
.
The unique login tools will modify your Unix group and some environment variables. If you use scripts that are automatically loaded or that are expected in a specific location (say .bashrc
) checkout the notes in the Layout of common files and directories and Accessing the storage areas documents.
In this section we will present the myproject
command. When freshly connected, your shell’s working directory will be your current project’s personal home directory or, if your account is not linked to any account, your personal home. Again refer to Accessing the storage areas for more details on the various storage areas. Your first step could be to list the flags myproject
supports and that can be done like so:
$ myproject --help
usage: my_project.py [-h] [-s [project] | -S | -l | -a project | -c | -C | -m [project]]
Manage your hpc projects. The active project is the current project in your
session.
optional arguments:
-h, --help show this help message and exit
-s [project], --state [project]
Get current HPC projects state
-S, --stateall Get all HPC projects state
-l, --list List all authorized HPC projects
-a project, --activate project
Activate the indicated project
-c, --cines List projects directories CINES variables
-C, --ccfr List projects directories CCFR variables
-m [project], --members [project]
List all members of a project
The most used commands are -l
to list the project we are assigned to, -a
to switch project and -c
to list the environment variable described in Accessing the storage areas.
Listing the environment variables and their value
This is done like so (assuming a user with login someuser
):
$ myproject -c
Liste des variables CINES permettant l'accès aux répertoires dans les différents espaces de stockage
----------------------------------------------------------------------------------------------------
Project actif: dci
OWN_HOMEDIR : /lus/home/PERSO/grp_someuser/someuser
HOMEDIR : /lus/home/BCINES/dci/someuser
SHAREDHOMEDIR : /lus/home/BCINES/dci/SHARED
SCRATCHDIR : /lus/scratch/BCINES/dci/someuser
SHAREDSCRATCHDIR : /lus/scratch/BCINES/dci/SHARED
WORKDIR : /lus/work/BCINES/dci/someuser
SHAREDWORKDIR : /lus/work/BCINES/dci/SHARED
STOREDIR : /lus/store/BCINES/dci/someuser
gda2212_HOMEDIR : /lus/home/NAT/gda2212/someuser
gda2212_SHAREDHOMEDIR : /lus/home/NAT/gda2212/SHARED
gda2212_SCRATCHDIR : /lus/scratch/NAT/gda2212/someuser
gda2212_SHAREDSCRATCHDIR : /lus/scratch/NAT/gda2212/SHARED
gda2212_WORKDIR : /lus/work/NAT/gda2212/someuser
gda2212_SHAREDWORKDIR : /lus/store/NAT/gda2212/SHARED
gda2212_STOREDIR : /lus/store/NAT/gda2212/someuser
dci_HOMEDIR : /lus/home/BCINES/dci/someuser
dci_SHAREDHOMEDIR : /lus/home/BCINES/dci/SHARED
dci_SCRATCHDIR : /lus/scratch/BCINES/dci/someuser
dci_SHAREDSCRATCHDIR : /lus/scratch/BCINES/dci/SHARED
dci_WORKDIR : /lus/work/BCINES/dci/someuser
dci_SHAREDWORKDIR : /lus/store/BCINES/dci/SHARED
dci_STOREDIR : /lus/store/BCINES/dci/someuser
Observe that the actif
project (current project in english) is dci
in the example above. This should be interpreted as: the shell being currently setup so that the generic environment variables point to the project’s filesystem directories. For instance ${SHAREDSCRATCHDIR}
would point to the actif
project’s group shared scratch space, in this case, /lus/scratch/BCINES/dci/SHARED
. For more details on the file system spaces CINES offers, see Accessing the storage areas.
As such, an actif
project does not relate to a DARI related notion of activated, valid, ongoing, etc..
Listing associated projects
This is done like so (assuming a user with login someuser
):
$ myproject -l
Projet actif: dci
Liste des projets de calcul associés à l'utilisateur 'someuser' : ['gda2212', 'dci']
Switching project
You can rely on the ${ACTIVE_PROJECT}
environment variable to obtain the currently used project:
$ echo ${ACTIVE_PROJECT}
dci
This is done like so (assuming a user with login someuser
):
$ myproject -a gda2212
Projet actif :dci
Bascule du projet "dci" vers le projet "gda2212"
Projet " gda2212 " activé.
$ myproject -c
Liste des variables CINES permettant l'accès aux répertoires dans les différents espaces de stockage
----------------------------------------------------------------------------------------------------
Project actif: gda2212
OWN_HOMEDIR : /lus/home/PERSO/grp_someuser/someuser
HOMEDIR : /lus/home/NAT/gda2212/someuser
SHAREDHOMEDIR : /lus/home/NAT/gda2212/SHARED
SCRATCHDIR : /lus/scratch/NAT/gda2212/someuser
SHAREDSCRATCHDIR : /lus/scratch/NAT/gda2212/SHARED
WORKDIR : /lus/work/NAT/gda2212/someuser
SHAREDWORKDIR : /lus/work/NAT/gda2212/SHARED
STOREDIR : /lus/store/NAT/gda2212/someuser
gda2212_HOMEDIR : /lus/home/NAT/gda2212/someuser
gda2212_SHAREDHOMEDIR : /lus/home/NAT/gda2212/SHARED
gda2212_SCRATCHDIR : /lus/scratch/NAT/gda2212/someuser
gda2212_SHAREDSCRATCHDIR : /lus/scratch/NAT/gda2212/SHARED
gda2212_WORKDIR : /lus/work/NAT/gda2212/someuser
gda2212_SHAREDWORKDIR : /lus/store/NAT/gda2212/SHARED
gda2212_STOREDIR : /lus/store/NAT/gda2212/someuser
dci_HOMEDIR : /lus/home/BCINES/dci/someuser
dci_SHAREDHOMEDIR : /lus/home/BCINES/dci/SHARED
dci_SCRATCHDIR : /lus/scratch/BCINES/dci/someuser
dci_SHAREDSCRATCHDIR : /lus/scratch/BCINES/dci/SHARED
dci_WORKDIR : /lus/work/BCINES/dci/someuser
dci_SHAREDWORKDIR : /lus/store/BCINES/dci/SHARED
dci_STOREDIR : /lus/store/BCINES/dci/someuser
As you can see, the ${HOMEDIR}
, ${SHAREDHOMEDIR}
etc. have changed when the user switched project (compared to the output presented here). That said, the prefixed variables like ${dci_HOMEDIR}
didn’t change and using it is the recommended way to reference a directory assuming you do not know which project will be loaded when the variable will be used (say, in a script).
Some issues can be encountered when using tools that are unaware of the many home structure. Yat again, check the Layout of common files and directories and Accessing the storage areas documents.
Layout of common files and directories
Due to new functionalities introduced through Login unique, you may find the Accessing the storage areas document useful. It describes the multiple home directories and how to access them through environment variable (${HOMEDIR}
, ${OWN_HOMEDIR}
etc.).
Some subtleties needs addressing, see below.
.bashrc
file
Your .bashrc
file should be accessible in the ${HOMEDIR}
directory (project personal home).
Using symbolic links, you can prevent file redundancy by first, storing our .bashrc
in your ${OWN_HOMEDIR}
and creating a link in your ${HOMEDIR}
. Effectively, you are factorizing the .bashrc
:
$ ln -s "${OWN_HOMEDIR}/.bashrc" "${HOMEDIR}/.bashrc"
If you want your .bashrc
to be loaded when you login to the machine you need to make sure a file called .bash_profile
is present in your ${HOMEDIR}
directory (project personal home). This file, if not present, should thus be created to contain:
if [ -f ~/.bashrc ]; then
source ~/.bashrc
fi
Similarly to the .bashrc
you can use links to factorize this file.
.ssh
directory
Your .ssh
directory should be accessible in the ${OWN_HOMEDIR}
directory (personal home).
Optionally, you can create link in your ${HOMEDIR}
to point to ${OWN_HOMEDIR}/.ssh
Programming environment
The programming environment includes compiler toolchains, libraries, performance analysis and debugging tools and optimized scientific libraries. Adastra, being a Cray machine, it uses the Cray Programming Environment abbreviated CrayPE or CPE. In practice a CrayPE is simply a set of module. This section tries to shed light on the subtleties of the system’s environment.
The Cray documentation is available in the man pages (prefixed with intro_
) and is starting to be mirrored and enhanced at this URL https://cpe.ext.hpe.com/docs/.
Module, why and how
Like on many HPC machines, the software is presented through modules. A module can be mostly seen as a set of environment variable. Variables such as the ${PATH}
, ${LD_LIBRARY_PATH}
are modified to introduce new tools in the environment. The software providing the module concept is Lmod, a Lua-based module system for dynamically altering a shell environment.
General usage
The interface to Lmod is provided by the module
command:
Command |
Description |
---|---|
|
Shows the list of the currently loaded modules. |
|
Shows a view of modules aggregated over the versions. |
|
Shows a table of the currently available modules. |
|
Shows a table of the currently available modules and also show hidden module (very useful !). |
|
Unloads all modules. |
|
Shows the environment changes made by the |
|
Loads the given |
|
Shows help information about |
|
Searches all possible modules according to |
|
Adds |
|
Removes |
|
Reloads all currently loaded modules. |
Lmod introduces the concept of default and currently loaded modules. When the user enters the module available
command, he may get something similar to the small example given below.
$ module available
---- /opt/cray/pe/lmod/modulefiles/comnet/crayclang/14.0/ofi/1.0 ----
cray-mpich/8.1.20 (L,D) cray-mpich/8.1.21
Where:
L: Module is loaded
D: Default Module
Note the L
and D
described at the end of the example. It shows you what is loaded and what is loaded by default when you do not specify the version of a module (that is, you omit the /8.1.21
for instance). Note that D
does not mean it is loaded automatically but that, if a module is to be loaded (say cray-mpich
) and the version is not specified, then, it’ll load the module marked by D
(say cray-mpich/8.1.20
). It is considered good practice to specify the full name to avoid issues related to more complicated and complex topics (compilation, linkage, etc.).
Note
By default some modules are loaded and this differs from older machines hosted at CINES such as Occigen.
Note
The --terse
option can be useful when the output of the module
command needs to be parsed in scripts.
Looking for a specific module or an already installed software
Modules with dependencies are only available (show in module available
) when their dependencies, such as compilers, are loaded. To search the entire hierarchy across all possible dependencies, the module spider
command can be used as summarized in the following table.
Command |
Description |
---|---|
|
Shows the entire possible graph of modules. |
|
Searches for modules named |
|
Searches for a specific version of |
|
Searches for modulefiles containing |
CrayPE basics
The CrayPE is often feared due to its apparent complexity. We will try to present the basic building blocs and show how assembling these blocs.
At a high level, the a Cray environment is made up of:
External libraries (such as the ones in ROCm);
Cray libraries (MPICH, libsci);
Architecture modules (
craype-accel-amd-gfx90a
);Compilers (
craycc
as thecce
module,amdclang
as theamd
module,gcc
asgnu
module);The Cray compiler wrappers (
cc
,CC
,ftn
) offered by thecraype
module;The
PrgEnv
modules (PrgEnv-cray
);And the
cpe/XX.YY
.
The external libraries refer to libraries the CrayPE requires but are not the property of Cray, AMD’s ROCm is such an example. The Cray libraries are closed source software, there are multiple variants of the same library to accommodate for the GPU and many compiler support. The architecture modules will change the wrapper’s behavior (see Cray compiler wrapper) by helping choosing which library to link against (say, the MPICH GPU plugin), or modifying the flags such as -march=zen4
. The compilers are not recommended to be directly used; they should instead be used through the Cray compiler wrapper which will interpret the PrgEnv
, the loaded Cray library and architecture modules to handle the compatibility matrix transparently (with few visible artifacts). The PrgEnv
are preset environments, you can choose to use them or cherry-pick you own set of module, at your own risk. The cpe/XX.YY
modules are used to change the default version of the above mentioned modules and allows you to operate a set of intercompatible default modules.
Note
There is an order in which we recommend loading the modules. See the note in Targeting an architecture.
Important
Do not forget to export the appropriate environment variable such as CC
, CXX
etc. and make them point to the correct compiler or Cray compiler wrapper by loading the correct PrgEnv
. This is can be crucial for tools like CMake and Make.
Changing CrayPE version
A Cray Programming Environment (CrayPE) can be simply viewed as a set of module (of a particular version). Switching CrayPE is like switching modules and defining new versions.
You can load a cpe/XX.YY
module to prepare your environment with the modules associated to a specific XX.YY
version of cpe
. In practice, it will change the version of your loaded modules to match the version the cpe/XX.YY
in question is expecting and, in addition, will modify the default version of the Cray modules.
Warning
If you use a cpe/XX.YY
module, it must come first before you load any other Cray modules.
Important
You can preload a cpe/XX.YY
module before preparing your environment to be sure you are using the correct version of the modules you load.
As an example:
1$ module available cpe
2-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
3 cpe/22.11 cpe/22.12 cpe/23.02 (D)
4$ module purge
5-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
6 cce/15.0.0 cce/15.0.1 (D)
7$ module load PrgEnv-cray
8$ module list
9Currently Loaded Modules:
10 1) cce/15.0.1 2) craype/2.7.19 3) cray-dsmml/0.2.2
11 2) libfabric/1.15.2.0 5) craype-network-ofi 6) cray-mpich/8.1.24
12 3) cray-libsci/23.02.1.1 8) PrgEnv-cray/8.3.3
13$ module load cpe/22.12
14The following have been reloaded with a version change:
15 1) cce/15.0.1 => cce/15.0.0
16 2) cray-libsci/23.02.1.1 => cray-libsci/22.12.1.1
17 3) cray-mpich/8.1.24 => cray-mpich/8.1.23
18$ module available cce
19-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
20 cce/15.0.0 (L,D) cce/15.0.1
21$ module load cpe/23.02
22Unloading the cpe module is insufficient to restore the system defaults.
23Please run 'source /opt/cray/pe/cpe/22.12/restore_lmod_system_defaults.[csh|sh]'.
24
25The following have been reloaded with a version change:
26 1) cce/15.0.0 => cce/15.0.1
27 2) cpe/22.12 => cpe/23.02
28 3) cray-libsci/22.12.1.1 => cray-libsci/23.02.1.1
29 4) cray-mpich/8.1.23 => cray-mpich/8.1.24
30$ module available cce
31-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
32 cce/15.0.0 cce/15.0.1 (L,D)
As we can see, the cpe/22.12
changed the modules version and also changed the default modules version.
Note
Loading a cpe
module will lead to a quirk which is shown line 22. The quirks comes from the fact that unloading a module that switches modules does not bring the environment back to it states before the switching, in fact, it does nothing. Once the module is unloaded, the default module version are restored but we have to load them back. This is the role of the above mentioned script (restore_lmod_system_defaults.sh
).
Cray compiler wrapper
As you may know, compatibilities between compilers and libraries is not always guaranteed and a compatibility matrix can be given to the user who are left to themselves to figure out how to combine the software components. Loading the PrgEnv-<compiler>[-<compiler2>]
module introduces a compiler wrapper (also called driver) which will interpret environment variables introduced by other Cray modules such as craype-accel-amd-gfx90a
(see Targeting an architecture for more details), cray-mpich
, etc.. The driver creates the toolchain needed to satisfy the request (compilation, optimization, link, etc.). It also uses the information gathered in the environment to specify include paths, link flags, architecture specific flags, etc. that the underlying compiler needs to produce code. Effectively, theses compiler wrappers abstract the compatibility matrix away from the user; linking and providing the correct headers at compile and run time is only a subset of the features provided by the Cray compiler wrappers. If you do not use the wrappers, you will have to do more work and expose yourself to error prone manipulations.
PrgEnv
and compilers
The compilers available on Adastra are provided through the Cray environment modules. Most of the readers already know about the GNU software stack. Adastra comes with three more supported compilers. The Cray and the AMD Radeon Open Compute (ROCm) compilers are both based on the state of the art LLVM Compiler Infrastructure. In fact you can treat these compilers as vendor recompiled Clang/Flang LLVM compilers with added optimization passes or OpenMP backend in the case of the Cray compiler (but not much more). The AMD Optimizing C/C++ Compiler (AOCC) compiler resemble the Intel ICC compiler, but for AMD. The AOCC compiler is based on LLVM. There is also a system (OS provided) versions of GCC available in /usr/bin
(try not using it).
The Programming environment
column of the table below represent the module to load to beneficiate from a specific environment. You can load a compiler module after loading a PrgEnv
to choose a specific version of a compiler belonging to a given PrgEnv
. That is, load cce/15.0.0
after loading PrgEnv-cray
to make sure you get the cce/15.0.0
compiler. The modules loaded by a PrgEnv
will change as the environment evolves. After the first load of a PrgEnv
, you are recommended to save the module implicitly loaded (module list
) and explicitly load them to avoid future breakage.
Vendor |
Programming environment |
Compiler module |
Language |
Compiler wrapper |
Raw compiler |
Usage and notes |
---|---|---|---|---|---|---|
Cray |
|
|
C |
|
|
For CPU and GPU compilations. |
C++ |
|
|
||||
Fortran |
|
|
||||
AMD |
|
|
C |
|
|
For CPU and GPU compilations. This module introduces the ROCm stack. ROCm is AMD’s GPGPU software stack. These compilers are open source and available on Github. You can contact AMD via Github issues. |
C++ |
|
|
||||
Fortran |
|
|
||||
AMD |
|
|
C |
|
|
For CPU compilations. These compilers are LLVM based but the LLVM fork are not open sourced. |
C++ |
|
|
||||
Fortran |
|
|
||||
GNU |
|
|
C |
|
|
For CPU compilations. |
C++ |
|
|
||||
Fortran |
|
|
||||
Intel |
|
|
C |
|
|
For CPU compilations. The historical |
C++ |
|
|
||||
Fortran |
|
|
||||
Intel |
|
|
C |
|
|
For CPU compilations. Intel’s historical (but good) toolchain. |
C++ |
|
|
||||
Fortran |
|
|
||||
Intel |
|
|
C |
|
|
For CPU compilations. Intel’s new toolchain based on LLVM and trying to democratize Sycl. |
C++ |
|
|
||||
Fortran |
|
|
Note
Reading (and understanding) the craycc
or crayftn
man pages will provide you with valuable knowledge on the usage of the Cray compilers.
Important
It is highly recommended to use the Cray compiler wrappers (cc
, CC
, and ftn
) whenever possible. These are provided whichever programming environment is used. These wrappers are somewhat like the mpicc
provided by other vendors.
Switching compiler is as simple as loading an other PrgEnv
. The user only needs to recompile the software, assuming the build scripts or build script generator scripts (say CMake scripts) are properly engineered.
For CPU compilations:
C/C++ codes can rely on
PrgEnv-gnu
,PrgEnv-aocc
orPrgEnv-cray
;Fortran codes can rely on
PrgEnv-gnu
,PrgEnv-cray
orPrgEnv-intel
.
Note
If you target the Genoa CPUs, you must ensure that the GCC version is more recent or equal to gcc/13.2.0
.
For GPU compilations:
C/C++ codes can rely
PrgEnv-amd
,PrgEnv-cray
or potentiallyPrgEnv-gnu
withrocm
;Fortran codes can rely
PrgEnv-cray
(required for OpenMP target/OpenACC + Fortran).
To know which compiler/PrgEnv to use depending on the parallelization technology your program relies on (OpenMP, OpenACC, HIP, etc.), check this table.
Note
Understand that while both AMD softwares, PrgEnv-amd
and PrgEnv-aocc
target a fundamentally different node kind, the first one is part of the ROCm stack (analogous to NVHPC), the second one is an historical CPU compiler (analogous to Intel’s ICC).
The PrgEnv-cray
(CCE), PrgEnv-amd
(ROCm), PrgEnv-gnu
, PrgEnv-aocc
and PrgEnv-aocc
all support the following C++ standards (and implied C standards): c++11
, gnu++11
, c++14
, gnu++14
, c++17
, gnu++17
, c++20
, gnu++20
, c++2b
, gnu++2b
. Some caveats exist regarding C++ modules with C++20. All theses compilers (expect GNU), are based on Clang.
the Fortran compiler all support the following standards: f90
, f95
, f03
.
Warning
If your code has, all along its life, relied on non standard, vendor specific extensions, you may have issues using an other compiler.
PrgEnv
mixing and subtleties
Cray provides the PrgEnv-<compiler>[-<compiler2>]
modules (say, PrgEnv-cray-amd
) that load a given <compiler>
and toolchain and optionally, if set, introduce an additional <compiler2>
. In case a <compiler2>
is specified, the Cray environment will use <compiler>
to compile Fortran sources and <compiler2>
for C and C++ sources. The user can then enrich his environment by loading other libraries through modules (though some of these libraries are loaded by default with the PrgEnv
).
Introducing an environment, toolchain or tool through the use of modules means that loading a module will modify environment variables such as ${PATH}
, ${ROCM_PATH}
, ${LD_LIBRAY_PATH}
to make the tool or toolchain available to the user’s shell.
For example, say you wish to use the Cray compiler to compile CPU or GPU code, introduce the CCE toolchain this way:
$ module load PrgEnv-cray
Say you want to use the Cray compiler to compile Fortran sources and use the AMD compiler for C and C++ sources, introduce the CCE and ROCm toolchains this way:
$ module load PrgEnv-cray-amd
Say you want to use the AMD compiler to compile CPU or GPU code, introduce the ROCm toolchain this way:
$ module load PrgEnv-amd
Mixing PrgEnv
and toolchain
Say you want to use the Cray compiler to compile CPU or GPU code and also have access to the ROCm tools and libraries, introduce the CCE and ROCm tooling this way:
$ module load PrgEnv-cray amd-mixed
Mixing compilers and tooling is achieved through the *-mixed
modules. *-mixed
modules do not significantly alter the Cray compiler wrapper’s behavior. They can be used to steer the compiler in using, say, the correct ROCm version instead of the default one (/opt/rocm
).
*-mixed
modules can be viewed as an alias to the underlying software. For instance, amd-mixed
would be an alias for the rocm
module.
Targeting an architecture
In a Cray environment, one can load modules to target architectures instead of adding compiler flags explicitly.
On Adastra’s accelerated nodes, we have AMD-Trento (host CPU) and AMD-MI250X (accelerator) as the two target architectures. The command module available craype-
will show all the installed modules for available target architectures. For AMD-Trento the module is craype-x86-trento
, for AMD-MI250X it would be craype-accel-amd-gfx90a
and for MI300A it is craype-accel-amd-gfx942
. These modules add environment variables used by the Cray compiler wrapper to trigger flags used by the compilers to optimize or produce code for these two architectures.
Warning
If you load a non-cpu target module, say craype-accel-amd-gfx90a
, please do also load the *-mixed
or toolchain module (rocm
) associated to the target device, else you expose yourself to a debugging penance.
For example, to setup a MI250X GPU programming environment:
$ module purge
$ # A CrayPE environment version
$ module load cpe/24.07
$ # An architecture
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ # A compiler to target the architecture
$ module load PrgEnv-cray
$ # Some architecture related libraries and tools
$ module load amd-mixed
You get a C/C++/Fortran compiler configured to compile for Trento CPUs and MI250X GPUs and automatically link with the appropriate Cray MPICH release, that is, if you use the Cray compiler wrappers.
Warning
If you get a warning such as this one Load a valid targeting module or set CRAY_CPU_TARGET
, it is probably because you did not load a craype-x86-<architecture>
module.
Note
Try to always load, first, the CPU and GPU architectures (say, craype-x86-trento
for the GENOA partition and craype-x86-trento
, craype-accel-amd-gfx90a
for the MI250 partition), then the PrgEnv
and the rest of your modules.
Intra-process parallelization technologies
When you are not satisfied with the high level tools such as the vendor optimized BLAS, you have the option to program the machine by yourself. These technology are harder to use, more error prone but more versatile. Some technologies are given below, but the list is obviously not complete.
We could define at least two class of accelerator programming technologies. The ones based on directive (say, pragma omp parallel for
) and the ones base on kernels. A kernel is a treatment, generally the inner loops or body of the inner loops of what you would write on a serial code. The kernel is given data to transform and is explicitly mapped to the hardware compute units.
Note
NVHPC is Nvidia’s GPU software stack, ROCm is AMD’s GPU software stack (amd-mixed
or PrgEnv-amd
), CCE is part of CPE which is Cray’s CPU/GPU compiler toolchain (PrgEnv-cray
), LLVM is your plain old LLVM toolchain, OneAPI is Intel’s new CPU/GPU Sycl based software stack (contains the DPC++, aka Sycl compiler).
For C/C++ codes
Class |
Name |
Compiler support on AMD GPUs |
Compiler support on Nvidia GPUs |
Compiler support on Intel GPUs |
Compiler support on x86 CPUs |
Fine tuning |
Implementation complexity/maintainability |
Community support/availability (expected longevity in years) |
---|---|---|---|---|---|---|---|---|
Directive |
OpenACC v2 |
GCC~ |
NVHPC/GCC~ |
NVHPC/GCC~ |
Low-medium |
Low |
Medium/high (+5 y) |
|
OpenMP v5 |
CCE/LLVM |
NVHPC/CCE/LLVM |
OneAPI |
GCC/LLVM/NVHPC/CCE/OneAPI |
Low-medium |
Low |
High (+10 y) |
|
Kernel |
Sycl |
AdaptiveCPP/OneAPI |
AdaptiveCPP/OneAPI |
AdaptiveCPP/OneAPI |
AdaptiveCPP/OneAPI |
High |
Medium/high |
High (+10 y) |
CUDA/HIP |
LLVM/CCE |
NVHPC/LLVM/CCE |
High |
Medium/high |
High (+10 y) |
|||
Kokkos |
LLVM/AdaptiveCPP/OneAPI/CCE |
NVHPC/LLVM/AdaptiveCPP/OneAPI/CCE |
AdaptiveCPP/OneAPI |
All |
Medium/high |
Low/medium |
High (+10 y) |
Sycl, the Khronos consortium’ successor to OpenCL is quite complex, like its predecessor. Obviously, time will tell if it is worth investing in this technology but there is a significant ongoing open standardization effort.
Kokkos in itself is not on the same level as OpenACC, OpenMP, Cuda/HIP or Sycl because it serves as an abstraction of all theses.
Note
Cray’s CCE, AMD’s ROCm, Intel’s OneAPI (intel-llvm) and LLVM’s Clang share the same front end (what reads the code). Most are just a recompiled/extended version Clang with, generally open source. Cray’ C/C++ compiler is a Clang compiler with a modified proprietary backend (code optimization and library such as the OpenMP backend implementation).
For Fortran codes
Class |
Name |
Compiler support on AMD GPUs |
Compiler support on Nvidia GPUs |
Compiler support on Intel GPUs |
Compiler support on x86 CPUs |
Fine tuning |
Implementation complexity/maintainability |
Community support/availability (expected longevity in years) |
---|---|---|---|---|---|---|---|---|
Directive |
OpenACC v2 |
CCE/LLVM~/GCC~ |
NVHPC/CCE/LLVM~/GCC~ |
NVHPC/CCE/LLVM~/GCC~ |
Low-medium |
Low |
Medium/High (+5 y) |
|
OpenMP v5 |
CCE/LLVM~/GCC~ |
NVHPC/CCE/LLVM~/GCC~ |
OneAPI |
NVHPC/CCE/LLVM/GCC/OneAPI |
Low-medium |
Low |
High (+10 y) |
|
Kernel |
Some wrapper, preprocessor definitions, compiler and linker flags
A very thorough list of compiler flag meaning across different vendor is given in this document.
Flags conversion for Fortran program
Intel’s |
GNU’s |
Cray’s |
Note |
---|---|---|---|
|
Embed debug info into the binary. Useful for stack trace and GDB. |
||
|
|
Compile in debug mode. The |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Careful, this flags assumes the machine on which you compile has similar CPUs to the one where your code run. |
|
|
||
|
|
|
|
|
|
|
|
|
|
|
Flush denormal To Zero. If well designed, your code should not be very sensible to that. See the Fortran 2003 standard. |
|
|
||
|
|
~ |
For debug build only. |
|
|
Link Time Optimization (LTO) sometime called InterProcedural Optimization (IPO) or IPA. |
In case you use the GNU Fortran compiler and are subject to interface mismatch, use the -fallow-argument-mismatch
flag. An interface mismatch, that is, when you pass arguments of different types to the same interface (subroutine) is not standard conforming Fortran code! Here is an excerpt of the GNU Fortran compiler manual: Some code contains calls to external procedures with mismatches between the calls and the procedure definition, or with mismatches between different calls. Such code is non-conforming, and will usually be flagged with an error. Using -fallow-argument-mismatch
is strongly discouraged. It is possible to provide standard-conforming code which allows different types of arguments by using an explicit interface and TYPE(*).
Vectorizing for GCC and LLVM (clang) based compilers
To enabling vectorization of multiply/add operations and transcendental functions use -O3 -fno-math-errno -fno-trapping-math -ffp-contract=fast
. Note that instruction amy also be reordered (((a+b)+c) may be rewritten to (a+(b+c)).
Some LLVM details are given in this document.
Given this simple C++ code:
#include <cmath>
void square(double*a) {
a[0] = std::sqrt(a[0]);
a[1] = std::sqrt(a[1]);
a[2] = std::sqrt(a[2]);
a[3] = std::sqrt(a[3]);
}
Without the above flags one would get this horrible code:
square(double*):
push rbx
mov rbx, rdi
vmovsd xmm0, qword ptr [rdi]
vxorpd xmm1, xmm1, xmm1
vucomisd xmm0, xmm1
jb .LBB0_2
vsqrtsd xmm0, xmm0, xmm0
vmovsd qword ptr [rbx], xmm0
vmovsd xmm0, qword ptr [rbx + 8]
vucomisd xmm0, xmm1
jae .LBB0_4
.LBB0_5:
call sqrt@PLT
jmp .LBB0_6
.LBB0_2:
call sqrt@PLT
vxorpd xmm1, xmm1, xmm1
vmovsd qword ptr [rbx], xmm0
vmovsd xmm0, qword ptr [rbx + 8]
vucomisd xmm0, xmm1
jb .LBB0_5
.LBB0_4:
vsqrtsd xmm0, xmm0, xmm0
.LBB0_6:
vmovsd qword ptr [rbx + 8], xmm0
vmovsd xmm0, qword ptr [rbx + 16]
vxorpd xmm1, xmm1, xmm1
vucomisd xmm0, xmm1
jb .LBB0_8
vsqrtsd xmm0, xmm0, xmm0
vmovsd qword ptr [rbx + 16], xmm0
vmovsd xmm0, qword ptr [rbx + 24]
vucomisd xmm0, xmm1
jae .LBB0_10
.LBB0_11:
call sqrt@PLT
vmovsd qword ptr [rbx + 24], xmm0
pop rbx
ret
.LBB0_8:
call sqrt@PLT
vxorpd xmm1, xmm1, xmm1
vmovsd qword ptr [rbx + 16], xmm0
vmovsd xmm0, qword ptr [rbx + 24]
vucomisd xmm0, xmm1
jb .LBB0_11
.LBB0_10:
vsqrtsd xmm0, xmm0, xmm0
vmovsd qword ptr [rbx + 24], xmm0
pop rbx
ret
Properly vectorized it would look like so:
square(double*):
vsqrtpd ymm0, ymmword ptr [rdi]
vmovupd ymmword ptr [rdi], ymm0
vzeroupper
ret
Debugging with crayftn
Note
To flush the output stream (stdout) is a standard way, use the output_unit
named constant in the ISO_Fortran_env
module. E.G.: flush(output_unit)
. This is useful when debugging using the classic print/comment approach.
Feature/flag/environment variable |
Explanation |
---|---|
|
The -eD option enables all debugging options. This option is equivalent to specifying the -G0 option with the -m2, -rl, -R bcdsp, and -e0 options. |
|
Initializes all undefined local stack, static, and heap variables to 0 (zero). If a user variable is of type character, it is initialized to NUL. If logical, initialized to false. The stack variables are initialized upon each execution of the procedure. When used in combination with -ei, Real and Complex variables are initialized to signaling NaNs, while all other typed objects are initialized to 0. Objects in common blocks will be initialized if the common block is declared within a BLOCKDATA program unit compiled with this option. |
|
Initializes all undefined local stack, static, and heap variables of type REAL or COMPLEX to an invalid value (signaling NaN). |
|
Generates messages to note nonstandard Fortran usage. |
|
Controls the level of floating point optimizations, where n is a value between 0 and 4, with 0 giving the compiler minimum freedom to optimize floating point operations and 4 giving it maximum freedom. |
|
Has the highest probability of repeatable results, but also the highest performance penalty. |
|
Produces a source listing with loopmark information. To provide a more complete report, this option automatically enables the -O negmsg option to show why loops were not optimized. If you do not require this information, use the -O nonegmsg option on the same command line. Loopmark information will not be displayed if the -d B option has been specified. |
|
Include all reports in the listing (including source, cross references, options, lint, loopmarks, common block, and options used during compilation). |
|
Enable bound checking. |
A typical set of debugging flag could be -eD -ei -en -hbounds -K trap=divz,inv,ovf
.
crayftn
also offers sanitizers which turn on runtime checks for various forms of undefined or suspicious behavior. This is an experimental feature (in CrayFTN 17). If a check fails, a diagnostic message is produced at runtime explaining the problem.
Feature/flag/environment variable |
Explanation |
---|---|
|
Enables a memory error detector. |
|
Enables a data race detector. |
Further reading: man crayftn
.
Debugging with gfortran
A typical set of debugging flag could be -O1 -g -fcheck=all -ffpe-trap=invalid,zero,overflow -fbacktrace
or -O1 -g -fcheck=all -ffpe-trap=invalid,zero,overflow -fbacktrace -finit-real=snan -finit-integer=42 -finit-logical=true -finit-character=0
(this set of option will silence -Wuninitialized
).
Making the Cray wrappers spew their implicit flags
Assuming you have loaded an environment such as:
$ module purge
$ # A CrayPE environment version
$ module load cpe/24.07
$ # An architecture
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ # A compiler to target the architecture
$ module load PrgEnv-cray
The CC
, cc
and ftn
Cray wrappers imply a lot of flags that you may want to retrieve. This can be done like so:
$ CC --cray-print-opts=cflags
-I/opt/cray/pe/libsci/24.07.0/CRAY/18.0/x86_64/include -I/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/include -I/opt/cray/pe/dsmml/0.3.0/dsmml/include
$ CC --cray-print-opts=libs
-L/opt/cray/pe/libsci/24.07.0/CRAY/18.0/x86_64/lib -L/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib -L/opt/cray/pe/mpich/8.1.30/gtl/lib -L/opt/cray/pe/dsmml/0.3.0/dsmml/lib -Wl,--as-needed,-lsci_cray_mpi,--no-as-needed -lmpi_gtl_hsa -Wl,--as-needed,-lsci_cray,--no-as-needed -ldl -Wl,--as-needed,-lmpi_cray,--no-as-needed -lmpi_gtl_hsa -Wl,--as-needed,-ldsmml,--no-as-needed -L/opt/cray/pe/cce/18.0.0/cce/x86_64/lib/pkgconfig/../ -Wl,--as-needed,-lstdc++,--no-as-needed -Wl,--as-needed,-lpgas-shmem,--no-as-needed -lfi -lquadmath -lmodules -lfi -lcraymath -lf -lu -lcsup
We observe the implied compile and link flags for Cray MPICH (the GTL is here too) and the LibSci. Had you used a cray-hdf5
or some other Cray modules libraries, it would have been in commands’ output.
Warning
The libs
option return a list of linker flags containing instances of -Wl
. This can create serious CMake confusion. For this reason, we recommend that you strip them away like so: CRAY_WRAPPER_LINK_FLAGS="$({ cc --cray-print-opts=libs; CC --cray-print-opts=libs; ftn --cray-print-opts=libs; } | tr '\n' ' ' | sed -e 's/-Wl,--as-needed,//g' -e 's/,--no-as-needed//g')"
.
Once you have extracted the flags for a given CPE version you can store them in a machine/toolchain file.
Say you use CMake, here is an example of what you could use the above for:
$ CRAY_WRAPPER_LINK_FLAGS="$({ cc --cray-print-opts=libs; CC --cray-print-opts=libs; ftn --cray-print-opts=libs; } | tr '\n' ' ' | sed -e 's/-Wl,--as-needed,//g' -e 's/,--no-as-needed//g')"
$ cmake \
-DCMAKE_C_COMPILER=craycc \
-DCMAKE_CXX_COMPILER=crayCC \
-DCMAKE_Fortran_COMPILER=crayftn \
-DCMAKE_C_FLAGS="$(cc --cray-print-opts=cflags)" \
-DCMAKE_CXX_FLAGS="$(CC --cray-print-opts=cflags)" \
-DCMAKE_Fortran_FLAGS="$(ftn --cray-print-opts=cflags)" \
-DCMAKE_EXE_LINKER_FLAGS="${CRAY_WRAPPER_LINK_FLAGS}" \
..
Here we bypass all Cray wrappers (C/C++ and Fortran) and give CMake all the flags the wrapper would have implicitly added. This is clearly the recommanded way in case the wrapper causes you problems. We give multiple examples for compilers other than Cray in this document for a build of kokkos with a HIP and OpenMP CPU backend. the build is done using the Cray, amdclang++
or hipcc
drivers. The above is transposable to other build system/generator than CMake.
Note
The Cray wrappers use -I
and not -isystem
which is suboptimal for strict code using many warning flags (as it should be).
Note
Use the -craype-verbose
flag to display the command line produced by the Cray wrapper. This must be called on a file to see the full output (i.e., CC -craype-verbose test.cpp
). You may also try the --verbose
flag to ask the underlying compiler to show the command it itself launches.
crayftn
optimization level details
Now we provide a list of the differences between the flags implicitly enabled when either -O1
, -O2
or -O3
. Understand that -O3
under the crayftn
compiler is very aggressive and could be said to at least equate -Ofast
under your typical Clang or GCC when it comes to the floating point optimizations.
Warning
Cray reserves the right to change, for a new crayftn
version, the options enabled through -On
.
The options given below are bound to Cray Fortran : Version 15.0.1
. This may change with past and future versions.
O1
provides:
-h scalar1,vector1,unroll2,fusion2,cache0,cblock0,noaggress
-h ipa1,mpi0,pattern,modinline
-h fp2=approx,flex_mp=default,alias=default:standard_restrict
-h fma
-h autoprefetch,noconcurrent,nooverindex,shortcircuit2
-h noadd_paren,nozeroinc,noheap_allocate
-h align_arrays,nocontiguous,nocontiguous_pointer
-h nocontiguous_assumed_shape
-h fortran_ptr_alias,fortran_ptr_overlap
-h thread1,nothread_do_concurrent,noautothread,safe_addr
-h noomp -f openmp-simd
-h caf,noacc
-h nofunc_trace,noomp_analyze,noomp_trace,nopat_trace
-h nobounds
-h nomsgs,nonegmsgs,novector_classic
-h dynamic
-h cpu=x86-64,x86-trento,network=slingshot10
-h nofp_trap -K trap=none
-s default32
-d 0abcdefgijnpvxzBDEGINPQSZ
-e hmqwACFKRTX
The discrepancies shown between O1
and O2
are:
-h scalar2,vector2
-h ipa3
-h thread2
The discrepancies shown between O2
and O3
or Ofast
are:
-h scalar3,vector3
-h ipa4
-h fp3=approx
AOCC flags
AMD gives a detailed description of the CPU optimization flags here: https://rocm.docs.amd.com/en/docs-5.5.1/reference/rocmcc/rocmcc.html#amd-optimizations-for-zen-architectures.
Understanding your compiler
GCC offers the following two flag combination that allows you to dig deeper into the default choices made to compiler for your architecture.
$ gcc -Q --help=target
$ # Works for clang too:
$ gcc -dM -E -march=znver3 - < /dev/null
Predefined preprocessor definitions
It can be useful to wrap code inside preprocessor control flow (ifdef
). We provide some definitions that can help choose a path for workaround code.
Feature/flag/environment variable |
Explanation |
---|---|
|
For the C/C++ languages, the compiler is Intel’s new compiler, |
|
For the C/C++ languages, the compiler is intel’s old compiler |
|
For the C/C++ languages, the compiler is Clang or one of its downstream fork. |
|
For the Fortran language, the compiler is GNU or a compiler mimicking it. |
|
For the C/C++ languages, the compiler is Microsoft MSVC or a compiler mimicking it. |
|
For the C/C++ languages, the compiler is Cray (a superset of clang). |
|
For the Fortran language, the compiler is Cray. |
Advanced tips and flags and environment variable for debugging
See LLVM Optimization Remarks by Ofek Shilon for more details on what Clang can tell you about how it optimizes you code and what tools are available to process that information.
Note
The crayftn
compiler does not provide an option to trigger debug info generation while also, not lowering optimization.
Note
The crayftn
compiler possess an extremely powerful optimizer which does of the most aggressive optimizations a compiler can afford to do. This means that using a high optimization level, the optimizer will assume your code has a strong standard compliance. Any slight deviation from the standard can lead to significant issue in the code, from crash to silent corruption. crayftn
’s -O2
is considered stable, safe and comparable to the -O3
of other compilers. -hipa4
has lead to issues in some codes. crayftn
also has his share of internal bugs which can mess up your code too.
Job submission
SLURM is the workload manager used to interact with the compute nodes on Adastra. In the following subsections, the most commonly used SLURM commands for submitting, running, and monitoring jobs will be covered, but users are encouraged to visit the official documentation and man pages for more information. This section describes how to run programs on the Adastra compute nodes, including a brief overview of SLURM and also how to map processes and threads to CPU cores and GPUs.
The SLURM batch scheduler and job launcher
SLURM provides multiple ways of submitting and launching jobs on Adastra’s compute nodes: batch scripts, interactive, and single-command. The SLURM commands allowing these methods are shown in the table below and examples of their use can be found in the related subsections. Please note that regardless of the submission method used, the job will launch on compute nodes, with the first node in the allocation serving as head-node.
With SLURM, you first ask for resources (a number of node, of GPU, of CPU) and then you distribute these resources on your tasks.
sbatch |
Used to submit a batch script. The batch script can contain information on the amount of resources to allocate and how to distribute them. Options can be specified when specifying the
sbatch command flags or inside the script, at the top of the file after the following prefix #SBATCH . The sbatch options do not necessarily lead to the resource distribution per rank that you would expect (!). sbatch allocates, srun distributes.See Batch scripts for more details.
|
srun |
Used to run a parallel job (job step) on the resources allocated with
sbatch or salloc .If necessary,
srun will first create a resource allocation in which to run the parallel job(s). |
salloc |
Used to allocate an interactive SLURM job allocation, where one or more job steps (i.e.,
srun commands) can then be launched on the allocated resources (i.e., nodes).See Interactive jobs for more details.
|
Batch scripts
A batch script can be used to submit a job to run on the compute nodes at a later time (the module used in the scripts below are here as an indication, you may not need them if you use PyTorch, Tensorflow or the CINES Spack modules). In this case, stdout and stderr will be written to a file(s) that can be opened after the job completes. Here is an example of a simple batch script for the GPU (MI250) partition:
1#!/bin/bash
2#SBATCH --account=<account_to_charge>
3#SBATCH --job-name="<job_name>"
4#SBATCH --constraint=MI250
5#SBATCH --nodes=1
6#SBATCH --exclusive
7#SBATCH --time=1:00:00
8
9module purge
10
11# A CrayPE environment version
12module load cpe/24.07
13# An architecture
14module load craype-accel-amd-gfx90a craype-x86-trento
15# A compiler to target the architecture
16module load PrgEnv-cray
17# Some architecture related libraries and tools
18module load amd-mixed
19
20module list
21
22export MPICH_GPU_SUPPORT_ENABLED=1
23
24# export OMP_<ICV=XXX>
25
26srun --ntasks-per-node=8 --cpus-per-task=8 --threads-per-core=1 --gpu-bind=closest -- <executable> <arguments>
Here is an example of a simple batch script for the GPU (MI300A) partition:
1#!/bin/bash
2#SBATCH --account=<account_to_charge>
3#SBATCH --job-name="<job_name>"
4#SBATCH --constraint=MI300
5#SBATCH --nodes=1
6#SBATCH --exclusive
7#SBATCH --time=1:00:00
8
9module purge
10
11# A CrayPE environment version
12module load cpe/24.07
13# An architecture
14module load craype-accel-amd-gfx942 craype-x86-genoa
15# A compiler to target the architecture
16module load PrgEnv-cray
17# Some architecture related libraries and tools
18module load amd-mixed
19
20module list
21
22export MPICH_GPU_SUPPORT_ENABLED=1
23
24# export OMP_<ICV=XXX>
25
26srun --ntasks-per-node=4 --cpus-per-task=24 --threads-per-core=1 --gpu-bind=closest -- <executable> <arguments>
Here is an example of a simple batch script for the CPU (GENOA) partition:
1#!/bin/bash
2#SBATCH --account=<account_to_charge>
3#SBATCH --job-name="<job_name>"
4#SBATCH --constraint=GENOA
5#SBATCH --nodes=1
6#SBATCH --exclusive
7#SBATCH --time=1:00:00
8
9module purge
10
11# A CrayPE environment version
12module load cpe/24.07
13# An architecture
14module load craype-x86-genoa
15# A compiler to target the architecture
16module load PrgEnv-cray
17
18module list
19
20
21
22
23
24# export OMP_<ICV=XXX>
25
26srun --ntasks-per-node=24 --cpus-per-task=8 --threads-per-core=1 -- <executable> <arguments>
Assuming the file is called job.sh
on the disk, you would launch it like so: sbatch job.sh
.
Options encountered after the first non-comment line will not be read by SLURM. In the example script, the lines are:
Line |
Description |
---|---|
1 |
Shell interpreter line. |
2 |
GENCI/DARI project to charge. More on that below. |
3 |
Job name. |
4 |
Type of Adastra node requested (here, the GPU MI250 or CPU GENOA partition). |
5 |
Number of compute nodes requested. |
6 |
Ask SLURM to reserve whole nodes. If this is not wanted, see Shared mode vs exclusive mode. |
7 |
Specify where the stderr and stdout streams should be saved to disk. |
8 |
Wall time requested ( |
10-19 |
Setup the module environment, always starting with a purge. |
21 |
(For the MI250/MI300 partition script) Enable GPU aware MPI. You can pass GPU buffers directly to the MPI APIs. |
25 |
Potentially, setup some OpenMP environment variables. |
27 |
Implicitly ask to use all of the node allocated. Then we distribute the work on 8 or 24 tasks per node. We also specify that the tasks should be bound to 8 cores, without Simultaneous Multithreading (SMT) and to the closest GPU to these 8 cores. |
The SLURM submission options are preceded by #SBATCH
, making them appear as comments to a shell (since comments begin with #
). SLURM will look for submission options from the first line through the first non-comment line. The mandatory SLURM flags are, the account identifier (also called project ID or project name and specified via --account=
), more on that later, the type of node (via --constraint=
), the maximal job runtime duration (via --time=
) and the number of nodes (via --nodes=
).
Some more advanced scripts are available in this document and this repository (though, the scripts of this repository are quite old).
Warning
A proper binding is often critical for HPC applications. We strongly recommend that you either make sure your binding is correct (say, using this tool hello_cpu_binding) or that you take a look at the binding scripts presented in Proper binding, why and how.
Note
The binding srun
does is only able to restrict a rank to a set of thread (process affinity towards hardware threads). It does not do what is called thread pinning/affinity. To exploit thread pinning, you may want to check OpenMP’s ${OMP_PROC_BIND}
and ${OMP_PLACES}
Internal Control Variables (ICVs)/environment variables. Bad thread pinning can be detrimental to performance. Check this document for more details.
The typical OpenMP ICVs used prevent and diagnose thread affinity issues rely on the following environment variable:
# Logs the rank to core/thread placement is correct.
export OMP_DISPLAY_AFFINITY=TRUE
export OMP_PROC_BIND=CLOSE
export OMP_PLACES=THREADS
# This should be redundant because srun already restrict the rank's CPU
# access.
export OMP_NUM_THREADS=<N>
Common SLURM submission options
The table below summarizes commonly-used SLURM job submission options:
Command (long or short) |
Description |
---|---|
|
Account identifier (also called project ID) to use and charge for the compute resources consumption. More on that below. |
|
Type of Adastra node. The accepted values are |
|
Maximum duration as wall clock time |
|
Number of compute nodes. |
|
Name of job. |
|
Standard output file name. |
|
Standard error file name. |
For more information about these or other options, please see the sbatch
man page.
Resource consumption and charging
French computing site resources are represented in hour of use of a given resource type. For instance, at CINES, if you have been given 100’000 hours on Adastra’s MI250X partition, it means that you could use a single unit of MI250X resource for 100’000 hours. It also means that you could use 400 units of MI250X resource for 250 hours. The units are given below:
Computing resource |
Unit description |
Example |
---|---|---|
MI250X partition |
2 GCD (GPU device) of an MI250X card, that is, a whole MI250X. |
1 hour on a MI250X node (exclusive) = 4 MI250X hours. |
MI300A partition |
1 GPU device. |
1 hour on a MI300A node (exclusive) = 4 MI300A hours. |
GENOA partition |
1 core (2 logical threads). |
1 hour on a GENOA node (exclusive) = 192 GENOA core hours. |
Warning
Due to an historical mistake, the eDARI website uses a unit for the MI250X partition that is a whole MI250X instead of the GCDs which is a half of an MI250X. If you ask for 50 MI250X hours on eDARI, you can in practice, use 100 MI250X GCD hours.
The resources you will consume will have to be charged to a project. Multiple times in this document have we invoked the --account=<account_to_charge>
SLURM flag. Before submitting the job, make sure you have set a valid <account_to_charge>
. You can obtain the list of account you are attached to by running the myproject -l
command. The values representing the account name you can charge are on the last line of the command output (i.e.: Liste des projets de calcul associés au user someuser : ['bae1234', 'eat4567', 'afk8901']
). More on myproject
in the Login unique section.
We do not charge for HPDA resources.
In addition, the <constraint>
in --constraint=<constraint>
should be set to a proper value as it is this SLURM flag that describes the kind of resource you will request and thus, that CINES will charge.
Note
To monitor your compute hours consumption, use the myproject --state [project]
command or visit https://reser.cines.fr/.
Warning
The charging gets a little bit less simple when you use the shared nodes.
Quality Of Service (QoS) queues
CINES respects the SLURM scheduler fair share constraints described by GENCI and common to CINES, IDRIS and TGCC.
On Adastra, queues are transparent, CINES does not publicizes the QoS. The user should not try to specify anything related to that subject (such as --qos=
). The SLURM scheduler will automatically place your job in the right QoS depending on the duration and resource quantity asked.
Queue priority rules are harmonized between the 3 computing centers (CINES, IDRIS and TGCC). We give a higher priority is given to large jobs, as Adastra is primarily dedicated to running large HPC jobs. The SLURM fairshare concept is up and running meaning that a user that assuming a linear consumption, if a user is above the line, its priority will be lower than a user ho is below the line. We may artificially lower a user’s priority if we notice bad practices (such as launching thousands of small jobs on an HPC machine). Priorities are calculated over a sliding window of one week. With a little patience, your job will eventually be processed.
The best advice we can give you is to correctly size your jobs. First check which node configuration works best, adjust the number of MPI, OpenMP thread and binding on a single node. Then do some scaling tests. Finally, do not specify a SLURM ``–time`` argument larger than what you really need, this is the most common scheduler misconfiguration on the user’s side.
srun
The default job launcher for Adastra is srun . The srun
command is used to execute an MPI enabled binary on one or more compute nodes in parallel. It is responsible for distributing the resources allocated by an salloc
or sbatch
command onto MPI ranks.
$ # srun [OPTIONS... [executable [args...]]]
$ srun --ntasks-per-node=24 --cpus-per-task=8 --threads-per-core=1 -- <executable> <arguments>
<output printed to terminal>
The output options have been removed since stdout and stderr are typically desired in the terminal window in this usage mode.
srun
accepts the following common options:
|
Number of nodes |
|
Total number of MPI tasks (default is 1). |
-c, --cpus-per-task=<ncpus> |
Logical cores per MPI task (default is 1).
When used with
--threads-per-core=1 : -c is equivalent to physical cores per task.We do not advise that you use this option when using
--cpu-bind=none . |
--cpu-bind=threads |
Bind tasks to CPUs.
threads - (default, recommended) Automatically generate masks binding tasks to threads. |
--threads-per-core=<threads> |
In task layout, use the specified maximum number of hardware threads per core.
(default is 2; there are 2 hardware threads per physical CPU core).
Must also be set in
salloc or sbatch if using --threads-per-core=2 in your srun command.threads-per-core should always be used instead of hint=nomultithread `` or ``hint=multithread . |
|
Try harder at killing the whole step if a process fails and return an error code different than 1. |
-m, --distribution=<value>:<value>:<value> |
Specifies the distribution of MPI ranks across compute nodes, sockets (L3 regions), and cores, respectively.
The default values are
block:cyclic:cyclic , see man srun for more information.Currently, the distribution setting for cores (the third
<value> entry) has no effect on Adastra |
--ntasks-per-node=<ntasks> |
If used without
-n : requests that a specific number of tasks be invoked on each node.If used with
-n : treated as a maximum count of tasks per node. |
|
Specify the number of GPUs required for the job (total GPUs across all nodes). |
|
Specify the number of GPUs per node required for the job. |
--gpu-bind=closest |
Binds each task to the GPU which is on the same NUMA domain as the CPU core the MPI rank is running on.
See the
--gpu-bind=closest example in Proper binding, why and how for more details. |
--gpu-bind=map_gpu:<list> |
Bind tasks to specific GPUs by setting GPU masks on tasks (or ranks) as specified where
<list> is <gpu_id_for_task_0>,<gpu_id_for_task_1>,... . If the number of tasks (orranks) exceeds the number of elements in this list, elements in the list will be reused as
needed starting from the beginning of the list. To simplify support for large task
counts, the lists may follow a map with an asterisk and repetition count. (For example
map_gpu:0*4,1*4 ). |
|
Request that there are ntasks tasks invoked for every GPU. |
--label |
Prefix every written lines from stderr or stdout with
<rank index>: where <rank index> starts at zeroand matches the MPI rank index that the writing process is.
|
Interactive jobs
Most users will find batch jobs as an easy way to use the system. Indeed, they allow the user to hand off a job to the scheduler, allowing the user to focus on other tasks while the job waits in the queue and eventually runs. Occasionally, it is necessary to run interactively, especially when developing, testing, modifying or debugging a code.
Since all compute resources are managed and scheduled by SLURM, it is not possible to simply log into the system and immediately begin running parallel codes interactively. Rather, you must request the appropriate resources from SLURM and, if necessary, wait for them to become available. This is done through an interactive batch job. Interactive batch jobs are submitted with the salloc
command. Resources are requested via the same options that are passed via #SBATCH
in a regular batch script (but without the #SBATCH
prefix). For example, to request an interactive batch job with MI250 resources, you would use salloc --account=<account_to_charge> --constraint=MI250 --job-name="<job_name>" --nodes=1 --time=1:00:00 --exclusive
. Note that there is no option for an output file, you are running interactively, so standard output and standard error will be displayed to the terminal.
You can then run the command you would generally put in the batch script: srun --ntasks-per-node=2 --cpus-per-task=8 --threads-per-core=1 --gpu-bind=closest -- <executable> <arguments>
.
If you want to connect to the node, you can directly ssh
on it, assuming you have allocated it.
You can also start a shell environment as a SLURM step (which on some machines is the only way to get interactive node access): srun --pty -- "${SHELL}"
.
Small job
Allocating a single GPU
The line below will allocate 1 GPU and 8 cores (no SMT), for 60 minutes.
$ srun \
--account=<account_to_charge> \
--constraint=MI250 \
--nodes=1 \
--time=1:00:00 \
--gpus-per-node=1 \
--ntasks-per-node=1 \
--cpus-per-task=8 \
--threads-per-core=1 \
-- <executable> <arguments>
Note
This is more of a hack than a serious usage of SLURM concepts or of HPC resources.
Packing
Note
We strongly advise that you get familiar with Adastra’s SLURM’s queuing concepts.
If your workflow consist of many small jobs, you may rely on the shared mode. That said, if you run many small jobs that can, put together, fill a whole node, you should use a whole node, not a shared one. This may shorten your queue time as we have and want to keep a small shared node count.
This is how we propose you use a whole node:
#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="<job_name>"
#SBATCH --constraint=GENOA
#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --time=1:00:00
set -eu
set -x
# How many run your logic needs.
STEP_COUNT=128
# due to the parallel nature of the SLURM steps described below, we need a
# way to properly log each one of them. See the:
# 2>&1 | tee "StepLogs/${SLURM_JOBID}.${I}"
mkdir -p StepLogs
for ((I = 0; I < STEP_COUNT; I++)); do
srun --exclusive --nodes=2 --ntasks-per-node=3 --cpus-per-task="4" --threads-per-core=1 --label \
-- ./work.sh 2>&1 | tee "StepLogs/${SLURM_JOBID}.${I}" &
done
# We started STEP_COUNT steps AKA srun processes, wait for them.
wait
In the script above, the steps will all be initiated but will start only when enough resource is available on the set of allocated resources (here we asked for 4 nodes). Here work.sh
represent your workload. This workload command would be executed as many times as STEP_COUNT*nodes*ntask-per-node=128*2*3=768
each with 4 cores. SLURM will automatically fill the resource allocated (here 4 nodes), queue and start as needed.
Chained job
SLURM offers a feature allowing the user to chain job. The user can, in fact, define a dependency graph of the jobs.
As an example, we want to start a job represented by my_first_job.sh
and start an other job my_second_job.sh
which we want to start only when my_first_job.sh
finished:
$ sbatch my_first_job.sh
Submitted batch job 189562
$ sbatch --dependency=afterok:189562 my_second_job.sh
Submitted batch job 189563
$ sbatch --dependency=afterok:189563 my_other_job.sh
Submitted batch job 189564
In this example we use the afterok
trigger meaning that only if the parent job ends successfully (exit code 0
) will it start.
You will then see something like this in squeue
:
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
189562 mi250 test bae PD 0:00 1 (Dependency)
189563 mi250 test bae PD 0:00 1 (Dependency)
189564 mi250 test bae R 0:04 1 g1057
Note the Dependency
status.
You can replace afterok
by after
, afterany
, afternotok
or singleton
.
More information here: https://slurm.schedmd.com/sbatch.html#OPT_dependency
Job array
Warning
If you launch job arrays, ensure that they do not contain more that 128 jobs or you will get an error related to AssocMaxSubmitJobLimit
.
Other common SLURM commands
The table below summarizes commonly-used SLURM commands:
|
Used to view partition and node information.
i.e., to view user-defined details about the batch queue:
sinfo -p batch -o "%15N %10D %10P %10a %10c %10z" |
|
Used to view job and job step information for jobs in the scheduling queue.
i.e., to see your own jobs:
squeue -l --me |
|
Used to view accounting data for jobs and job steps in the job accounting log (currently in the queue or recently completed).
i.e., to see a list of specified information about all jobs submitted/run by a users since 1 PM on January 4, 2023:
sacct -u <login> -S 2023-01-04T13:00:00 -o "jobid%5,jobname%25,user%15,nodelist%20" -X |
|
Used to signal or cancel jobs or job steps.
i.e., to cancel a job:
scancel <job_id> |
We describe some of the usage of these commands below in Monitoring and modifying batch jobs.
Job state
A job will transition through several states during its lifetime. Common ones include:
State
Code
|
State
|
Description
|
---|---|---|
CA |
Canceled |
The job was canceled (could’ve been by the user or an administrator). |
CD |
Completed |
The job completed successfully (exit code 0). |
CG |
Completing |
The job is in the process of completing (some processes may still be running). |
PD |
Pending |
The job is waiting for resources to be allocated. |
R |
Running |
The job is currently running. |
Job reason codes
In addition to state codes, jobs that are pending will have a reason code to explain why the job is pending. Completed jobs will have a reason describing how the job ended. Some codes you might see include:
Reason |
Meaning |
---|---|
Dependency |
Job has dependencies that have not been met. |
JobHeldUser |
Job is held at user’s request. |
JobHeldAdmin |
Job is held at system administrator’s request. |
Priority |
Other jobs with higher priority exist for the partition/reservation. |
Reservation |
The job is waiting for its reservation to become available. |
AssocMaxJobsLimit |
The job is being held because the user/project has hit the limit on running jobs. |
AssocMaxSubmitJobLimit |
The limit on the number of jobs a user is allowed to have running or pending at a given time has been met for the requested association (array). |
ReqNodeNotAvail |
The user requested a particular node, but it is currently unavailable (it is in use, reserved, down, draining, etc.). |
JobLaunchFailure |
Job failed to launch (could due to system problems, invalid program name, etc.). |
NonZeroExitCode |
The job exited with some code other than 0. |
Many other states and job reason codes exist. For a more complete description, see the squeue
man page (either on the system or online).
More reasons are given in the official SLURM documentation.
Monitoring and modifying batch jobs
scancel
: Cancel or signal a job
SLURM allows you to signal a job with the scancel
command. Typically, this is used to remove a job from the queue. In this use case, you do not need to specify a signal and can simply provide the jobid. For example, scancel 12345
.
In addition to removing a job from the queue, the command gives you the ability to send other signals to the job with the -s
option. For example, if you want to send SIGUSR1
to a job, you would use scancel -s USR1 12345
.
squeue
: View the job queue
The squeue
command is used to show the batch queue. You can filter the level of detail through several command-line options. For example:
|
Show all jobs currently in the queue. |
|
Show all of your jobs currently in the queue. |
|
Show all of your jobs that have yet to start and show their expected start time. |
sacct
: Get job accounting information
The sacct
command gives detailed information about jobs currently in the queue and recently-completed jobs. You can also use it to see the various steps within a batch jobs.
|
Show all jobs ( |
|
Show all of your jobs, and show the individual steps (since there was no |
|
Show all job steps that are part of job 12345. |
|
Show all of your jobs since 1 PM on July 1, 2022 using a particular output format. |
scontrol show job
: Get Detailed Job Information
In addition to holding, releasing, and updating the job, the scontrol
command can show detailed job information via the show job
subcommand. For example, scontrol show job 12345
.
Note
scontrol show job
can only report information on a job that is in the queue. That is, pending or running (but there are more states). A finished job is not in the queue and not queryable with scontrol show job
.
Obtaining the energy consumption of a job
On Adastra, we enable the user to monitor the energy his job consumes.
$ sacct --format=JobID,ElapsedRaw,ConsumedEnergyRaw,NodeList --jobs=<job_id>
JobID ElapsedRaw ConsumedEnergyRaw NodeList
-------------- ---------- ----------------- ---------------
<job_id> 104 12934230 c[1000-1003,10+
<job_id>.batch 104 58961 c1000
<job_id>.0 85 12934230 c[1000-1003,10+
The user obtains, for a given <job_id>
, the elapsed time in secondes and the energy consumption in joules for the whole job, the execution of the batch script and for each job steps. The job steps are suffixed with \.[0-9]+
(in regex form).
Each time you execute the srun
comment in a batch script, it creates a new job step. Here, there is only one srun
step which took 85 secondes and 12934230 joules.
Note
The duration of the step as reported by SLURM is not reliable for a short step. There may be an additional ~10 secondes.
Note
You will only get meaningful values regarding a job step once the job step has ended.
Note
The energy returned represents the aggregated node consumption. We do not include the network and storage costs as these ones are trickier to get and consist in a near fixed cost anyway (that is, whether you run are not your code).
Note
Some compute node may not return an energy consumed value. This leads to a value of 0
or empty under ConsumedEnergyRaw
. To workaround the issue, one can use the following command: scontrol show node | grep -e "CurrentWatts=n/s" -e "CurrentWatts=0" -B15 | grep "NodeName=" | cut -d '=' -f 2 | awk '{print $1}' | tr '\n' ','
and feed the result to the SLURM commands’ --exclude=
option. For instance: sbatch --exclude="$(scontrol show node | grep -e "CurrentWatts=n/s" -e "CurrentWatts=0" -B15 | grep "NodeName=" | cut -d '=' -f 2 | awk '{print $1}' | tr '\n' ',')" job.sh
.
Note
The counters SLURM uses to compute the energy consumption are visible in the following files: /sys/cray/pm_counters/*
.
Coredump files
If you start a program through our batch scheduler (SLURM), and if your program crashes, you will find your coredump files in the ${SCRATCHDIR}/COREDIR/<job_id>/<hostname>
directory. The ${SCRATCHDIR}
correspond to the scratch directory associated to your user and project specified in the #SBATCH --account=<account_to_charge>
batch script option. The files are stored in different folders depending on the <job_id>
. Additionally, if your job ran on multiple nodes, it is useful to have a way to differentiate which coredump file originate from which node, thus, the <hostname>
of the node is used to define a path for the coredump files.
The coredump filename has the following semantic: core_<signal>_<timestamp>_<process_name>.dump.<process_identifier>
(the equivalent core pattern being core_%s_%t_%e.dump
). As an example, you could have such coredump filename:
core_11_1693986879_segfault_testco.dump.2745754
You can then exploit a coredump file by using tools such as GDB like so:
$ gdb ./path/to/program.file ./path/to/coredump.file
You can find more information on GDB and coredump files here.
Warning
Be careful that you do not fill all your scratch space quota with coredump files. Notably, if you run a large job that crashes.
Note
On Adastra, ulimit -c unlimited
is the default. The coredump placement to scratch works on the HPDA, MI250 and GENOA partitions. To deactivate the core dumping, run the following command in, say, your batch script: ulimit -c 0
.
Note
Use gcore <pid>
to explicitly generate a core file of a running program.
Warning
For the placement of the coredump to the scratch to work, one needs to use either a batch script or the salloc
+ srun
commands. Simply allocating (salloc
) and ssh
ing to the node will not properly configure the coredump placement mechanism. Also, one needs to request nodes in exclusive mode for the placement to work (in shared mode it will not work).