Accessing Adastra

This document is a quick start guide for the Adastra machine. You can find additional information on the GENCI’s website and in this booklet.

Account opening

To access Adastra you need to have an account on the Demande d’Attribution de Ressources Informatique (DARI)’s website. Then, on eDARI, you need to ask to be associated to a research project with attributed Adastra compute hours. Following that, you can ask on eDARI for your personal account to be created on the machine (Adastra in this context). You will have to fill in a form, which to be valid, needs the three parties below have dated and electronically signed your account request:

  • The person who made the request;

  • the user’s security representative (often related to his laboratory);

  • the laboratory director.

You will then receive, via email, the instructions containing your credentials.

Connecting

To connect to Adastra, ssh to adastra.cines.fr.

$ ssh <login>@adastra.cines.fr

Warning

Authenticating to Adastra using ssh keys is not permitted. You will have to enter your password.

To connect to a specific login node, use:

$ ssh <login>@adastra<login_node_number>.cines.fr

Where <login_node_number> represents a integer login node identifier. For instance, ssh anusername@adastra5.cines.fr will connect you to the login number 5.

X11 forwarding

Automatic forwarding of the X11 display to a remote computer is possible with the use of SSH and a local (i.e., on your desktop) X server. To set up automatic X11 forwarding within SSH, you can do one of the following:

Invoke ssh with -X:

$ ssh -X <login>@adastra.cines.fr

Note that use of the -x flag (lowercase) will disable X11 forwarding. Users should not manually set the ${DISPLAY} environment variable for X11 forwarding.

Warning

If you have issues when launching a GUI application, make sure this is not related to the .Xauthority file. If it is, or you are not sure it is, checkout the .Xauthority file document.

Login unique

The login unique (in english, single sign on or unique login) is a new feature of the CINES’ supercomputer that will enable a user to work on multiple projects using a single, unique login. These logins (also called username) will be valid the lifetime of the machine (though the data may not, see Quotas for more details). This simplifies authentication over time. This procedure is already used in the other two national centres (IDRIS and TGCC). The method for logging into the machine remains the same as before and as described above. Once you are logged in, you get access to one of your home directory which is the home associated to your current project (if you have one). At this stage, you can adapt your environment to the project you wish to work on with the help of the command myproject.

The unique login tools will modify your Unix group and some environment variables. If you use scripts that are automatically loaded or that are expected in a specific location (say .bashrc) checkout the notes in the Layout of common files and directories and Accessing the storage areas documents.

In this section we will present the myproject command. When freshly connected, your shell’s working directory will be your current project’s personal home directory or, if your account is not linked to any account, your personal home. Again refer to Accessing the storage areas for more details on the various storage areas. Your first step could be to list the flags myproject supports and that can be done like so:

$ myproject --help
usage: my_project.py [-h] [-s [project] | -S | -l | -a project | -c | -C | -m [project]]

Manage your hpc projects. The active project is the current project in your
session.

optional arguments:
-h, --help            show this help message and exit
-s [project], --state [project]
                        Get current HPC projects state
-S, --stateall        Get all HPC projects state
-l, --list            List all authorized HPC projects
-a project, --activate project
                        Activate the indicated project
-c, --cines           List projects directories CINES variables
-C, --ccfr            List projects directories CCFR variables
-m [project], --members [project]
                        List all members of a project

The most used commands are -l to list the project we are assigned to, -a to switch project and -c to list the environment variable described in Accessing the storage areas.

Listing the environment variables and their value

This is done like so (assuming a user with login someuser):

$ myproject -c
Liste des variables CINES permettant l'accès aux répertoires dans les différents espaces de stockage
----------------------------------------------------------------------------------------------------
Project actif: dci

OWN_HOMEDIR :     /lus/home/PERSO/grp_someuser/someuser

HOMEDIR :          /lus/home/BCINES/dci/someuser
SHAREDHOMEDIR :    /lus/home/BCINES/dci/SHARED
SCRATCHDIR :       /lus/scratch/BCINES/dci/someuser
SHAREDSCRATCHDIR : /lus/scratch/BCINES/dci/SHARED
WORKDIR :          /lus/work/BCINES/dci/someuser
SHAREDWORKDIR :    /lus/work/BCINES/dci/SHARED
STOREDIR :         /lus/store/BCINES/dci/someuser


gda2212_HOMEDIR :          /lus/home/NAT/gda2212/someuser
gda2212_SHAREDHOMEDIR :    /lus/home/NAT/gda2212/SHARED
gda2212_SCRATCHDIR :       /lus/scratch/NAT/gda2212/someuser
gda2212_SHAREDSCRATCHDIR : /lus/scratch/NAT/gda2212/SHARED
gda2212_WORKDIR :          /lus/work/NAT/gda2212/someuser
gda2212_SHAREDWORKDIR :    /lus/store/NAT/gda2212/SHARED
gda2212_STOREDIR :         /lus/store/NAT/gda2212/someuser

dci_HOMEDIR :          /lus/home/BCINES/dci/someuser
dci_SHAREDHOMEDIR :    /lus/home/BCINES/dci/SHARED
dci_SCRATCHDIR :       /lus/scratch/BCINES/dci/someuser
dci_SHAREDSCRATCHDIR : /lus/scratch/BCINES/dci/SHARED
dci_WORKDIR :          /lus/work/BCINES/dci/someuser
dci_SHAREDWORKDIR :    /lus/store/BCINES/dci/SHARED
dci_STOREDIR :         /lus/store/BCINES/dci/someuser

Observe that the actif project (current project in english) is dci in the example above. This should be interpreted as: the shell being currently setup so that the generic environment variables point to the project’s filesystem directories. For instance ${SHAREDSCRATCHDIR} would point to the actif project’s group shared scratch space, in this case, /lus/scratch/BCINES/dci/SHARED. For more details on the file system spaces CINES offers, see Accessing the storage areas.

As such, an actif project does not relate to a DARI related notion of activated, valid, ongoing, etc..

Listing associated projects

This is done like so (assuming a user with login someuser):

$ myproject -l
Projet actif: dci

Liste des projets de calcul associés à l'utilisateur 'someuser' : ['gda2211', 'gda2212', 'gda2215', 'dci']

Switching project

You can rely on the ${ACTIVE_PROJECT} environment variable to obtain the currently used project:

$ echo ${ACTIVE_PROJECT}
dci

This is done like so (assuming a user with login someuser):

$ myproject -a gda2212
Projet actif :dci

Bascule du projet "dci" vers le projet "gda2212"
Projet " gda2212 " activé.
$ myproject -c
Liste des variables CINES permettant l'accès aux répertoires dans les différents espaces de stockage
----------------------------------------------------------------------------------------------------
Project actif: gda2212

OWN_HOMEDIR :     /lus/home/PERSO/grp_someuser/someuser

HOMEDIR :          /lus/home/NAT/gda2212/someuser
SHAREDHOMEDIR :    /lus/home/NAT/gda2212/SHARED
SCRATCHDIR :       /lus/scratch/NAT/gda2212/someuser
SHAREDSCRATCHDIR : /lus/scratch/NAT/gda2212/SHARED
WORKDIR :          /lus/work/NAT/gda2212/someuser
SHAREDWORKDIR :    /lus/work/NAT/gda2212/SHARED
STOREDIR :         /lus/store/NAT/gda2212/someuser


gda2212_HOMEDIR :          /lus/home/NAT/gda2212/someuser
gda2212_SHAREDHOMEDIR :    /lus/home/NAT/gda2212/SHARED
gda2212_SCRATCHDIR :       /lus/scratch/NAT/gda2212/someuser
gda2212_SHAREDSCRATCHDIR : /lus/scratch/NAT/gda2212/SHARED
gda2212_WORKDIR :          /lus/work/NAT/gda2212/someuser
gda2212_SHAREDWORKDIR :    /lus/store/NAT/gda2212/SHARED
gda2212_STOREDIR :         /lus/store/NAT/gda2212/someuser

dci_HOMEDIR :          /lus/home/BCINES/dci/someuser
dci_SHAREDHOMEDIR :    /lus/home/BCINES/dci/SHARED
dci_SCRATCHDIR :       /lus/scratch/BCINES/dci/someuser
dci_SHAREDSCRATCHDIR : /lus/scratch/BCINES/dci/SHARED
dci_WORKDIR :          /lus/work/BCINES/dci/someuser
dci_SHAREDWORKDIR :    /lus/store/BCINES/dci/SHARED
dci_STOREDIR :         /lus/store/BCINES/dci/someuser

As you can see, the ${HOMEDIR}, ${SHAREDHOMEDIR} etc. have changed when the user switched project (compared to the output presented here). That said, the prefixed variables like ${dci_HOMEDIR} didn’t change and using it is the recommended way to reference a directory assuming you do not know which project will be loaded when the variable will be used (say, in a script).

Some issues can be encountered when using tools that are unaware of the many home structure. Yat again, check the Layout of common files and directories and Accessing the storage areas documents.

Layout of common files and directories

Due to new functionalities introduced through Login unique, you may find the Accessing the storage areas document useful. It describes the multiple home directories and how to access them through environment variable (${HOMEDIR}, ${OWN_HOMEDIR} etc.).

Some subtleties needs addressing, see below.

.bashrc file

Your .bashrc file should be accessible in the ${HOMEDIR} directory (project personal home).

Using symbolic links, you can prevent file redundancy by first, storing our .bashrc in your ${OWN_HOMEDIR} and creating a link in your ${HOMEDIR}. Effectively, you are factorizing the .bashrc:

$ ln -s "${OWN_HOMEDIR}/.bashrc" "${HOMEDIR}/.bashrc"

If you want your .bashrc to be loaded when you login to the machine you need to make sure a file called .bash_profile is present in your ${HOMEDIR} directory (project personal home). This file, if not present, should thus be created to contain:

if [ -f ~/.bashrc ]; then
    source ~/.bashrc
fi

Similarly to the .bashrc you can use links to factorize this file.

.ssh directory

Your .ssh directory should be accessible in the ${OWN_HOMEDIR} directory (personal home).

Optionally, you can create link in your ${HOMEDIR} to point to ${OWN_HOMEDIR}/.ssh

.Xauthority file

Your .Xauthority file should be accessible in the ${HOMEDIR} directory (project personal home).

In practice, this file gets created by the system in the ${OWN_HOMEDIR} directory (personal home). You need to create a link like so:

$ ln -s "${OWN_HOMEDIR}/.Xauthority" "${HOMEDIR}/.Xauthority"

Note

Make sure your ${XAUTHORITY} environment variable correctly points to ${OWN_HOMEDIR}/.Xauthority.

Programming environment

The programming environment includes compiler toolchains, libraries, performance analysis and debugging tools and optimized scientific libraries. Adastra, being a Cray machine, it uses the Cray Programming Environment abbreviated CrayPE or CPE. In practice a CrayPE is simply a set of module. This section tries to shed light on the subtleties of the system’s environment.

The Cray documentation is available in the man pages (prefixed with intro_) and is starting to be mirrored and enhanced at this URL https://cpe.ext.hpe.com/docs/.

Module, why and how

Like on many HPC machines, the software is presented through modules. A module can be mostly seen as a set of environment variable. Variables such as the ${PATH}, ${LD_LIBRARY_PATH} are modified to introduce new tools in the environment. The software providing the module concept is Lmod, a Lua-based module system for dynamically altering a shell environment.

General usage

The interface to Lmod is provided by the module command:

Command

Description

module list

Shows the list of the currently loaded modules.

module overview

Shows a view of modules aggregated over the versions.

module available

Shows a table of the currently available modules.

module --show_hidden available

Shows a table of the currently available modules and also show hidden module (very useful !).

module purge

Unloads all modules.

module show <modulename>

Shows the environment changes made by the <modulename> modulefile.

module load <modulename> [...]

Loads the given <modulename>(s) into the current environment.

module help <modulename>

Shows help information about <modulename>.

module spider <string>

Searches all possible modules according to <string>.

module use <path>

Adds <path> to the modulefile search cache and ${MODULESPATH}.

module unuse <path>

Removes <path> from the modulefile search cache and ${MODULESPATH}.

module update

Reloads all currently loaded modules.

Lmod introduces the concept of default and currently loaded modules. When the user enters the module available command, he may get something similar to the small example given below.

$ module available
---- /opt/cray/pe/lmod/modulefiles/comnet/crayclang/14.0/ofi/1.0 ----
cray-mpich/8.1.20 (L,D)    cray-mpich/8.1.21

Where:
 L:  Module is loaded
 D:  Default Module

Note the L and D described at the end of the example. It shows you what is loaded and what is loaded by default when you do not specify the version of a module (that is, you omit the /8.1.21 for instance). Note that D does not mean it is loaded automatically but that, if a module is to be loaded (say cray-mpich) and the version is not specified, then, it’ll load the module marked by D (say cray-mpich/8.1.20). It is considered good practice to specify the full name to avoid issues related to more complicated and complex topics (compilation, linkage, etc.).

Note

By default some modules are loaded and this differs from older machines hosted at CINES such as Occigen.

Note

The --terse option can be useful when the output of the module command needs to be parsed in scripts.

Looking for a specific module or an already installed software

Modules with dependencies are only available (show in module available) when their dependencies, such as compilers, are loaded. To search the entire hierarchy across all possible dependencies, the module spider command can be used as summarized in the following table.

Command

Description

module spider

Shows the entire possible graph of modules.

module spider <modulename>

Searches for modules named <modulename> in the graph of possible modules.

module spider <modulename>/<version>

Searches for a specific version of <modulename> in the graph of possible modules.

module spider <string>

Searches for modulefiles containing <string>.

Spack

Before going into more technical subjects, know that CINES uses Spack to provide precompiled softwares which we shall name Spack products. We recommend that you first search for an already CINES provided software before trying to install it yourself. To look for such preinstalled software, search the modules using module spider or browse the catalog.

Spack environment concepts

Starting from 2023/09/28, CINES proposes the concept of Spack environment. A spack environment is a set of software targeting an architecture and is built using a specific toolchain. CINES will provide Spack environments for each hardware partition of Adastra.

CINES uses modules to represent the concept of a Spack environment. To use a Spack environment, a module needs to be loaded. The Spack environment are named using the following syntax <compiler>-[GPU|CPU]-<version>. The <version> number represents the production version number of the software stack. CINES will try to respect the semantic versioning concepts.

Adastra partition

Spack environment module

Compiler used

MI250 (GPU)


CCE-GPU-<version>
CCE
GCC-GPU-<version>
GNU
GENOA (CPU)


CCE-CPU-<version>
CCE
GCC-CPU-<version>
GNU

HPDA (GPU, visualization)

TBD

TBD

In addition to these Spack environments mapped to the partitions of Adastra, we provide what we call Spack user or Spack pre-configured environment (PreConf). It enables the user to install their own softwares by reusing the CINES’ Spack configuration and binary cache. Spack PreConf provides Spack support for a given partition and a set of compiler. The syntax is: spack-[MI250|GENOA]-<version>. More information on how you can reuse the CINES’ Spack configuration is given in the Using CINES’ pre-configured (PreConf) Spack installation section.

Running module available could give you:

--------------------------------- /opt/software/gaia/prod/latest/Core ----------------------------------
CCE-CPU-2.0.0 (D)   GCC-CPU-2.0.0 (D)   spack-MI250-2.0.0 (D) develop
CCE-GPU-2.0.0 (D)   GCC-GPU-2.0.0 (D)   spack-GENOA-2.0.0 (D)

The develop module exposes the software stacks being currently developed (release candidate). Running module load develop could give you:

--------------------------------- /opt/software/gaia/prod/dev/Core ----------------------------------
CCE-CPU-2.1.0 (D)   GCC-CPU-2.1.0 (D)   spack-MI250-2.1.0 (D)
CCE-GPU-2.1.0 (D)   GCC-GPU-2.1.0 (D)   spack-GENOA-2.1.0 (D)

A production release will take place every 3 to 6 months. Each such release may introduce new versions for the components of the stack (say a CCE or GCC version bump). As the machine evolves, more Spack environment (based on a CrayPE release) and their respective Spack products will be added, some will be deprecated. We do not guarantee a Spack environment will be produced for each CrayPE release. CINES is likely to release a new environment at the start of each DARI call. We guarantee that the CINES’ Spack products will be available for at least 2 releases before being removed from the system. This should give the user something like 9 months to 1 year of support per release.

When the software stack is put into production, the old modules are moved to a space called deprecated and are still available, as shown below using module available:

--------------------------------- /opt/software/gaia/deprecated ----------------------------------
CPE-23.02-aocc-3.2.0-CPU-softs    CPE-23.02-cce-15.0.1-GPU-softs    CPE-23.02-gcc-12.2.0-GPU-softs
CPE-23.02-cce-15.0.1-CPU-softs    CPE-23.02-gcc-12.2.0-CPU-softs    CPE-23.02-rocmcc-5.3.0-GPU-softs

Finally, deprecated modules are moved to the .archive directory and become available only after the module load archive command is executed.

Using a spack product

To see the softwares made available by a Spack environment, you can simply load the module corresponding to a Spack environment. That said the recommended way is the following; supposing you want to use the GROMACS software, you would do:

$ module spider gromacs
----------------------------------------------------------------------------
gromacs:
----------------------------------------------------------------------------
    Versions:
        gromacs/2021.4-mpi-omp-plumed-ocl-python3
        gromacs/2022.3-mpi-omp-plumed-python3
        gromacs/2022.5-mpi-omp-plumed-python3
        gromacs/2023-mpi-omp-plumed-ocl-python3
        gromacs/2023-mpi-omp-plumed-python

Here, module shows multiple GROMACS versions (2021.4, 2022.3, 2022.5 and 2023) but with different variants for the 2023 version. Lets use gromacs/2023-mpi-omp-plumed-ocl-python3. To find which programming environment provides the software package we do:

$ module spider gromacs/2023-mpi-omp-plumed-ocl-python3
----------------------------------------------------------------------------
gromacs: gromacs/2023-mpi-omp-plumed-ocl-python3
----------------------------------------------------------------------------

    You will need to load all module(s) on any one of the lines below before the "gromacs/2023-mpi-omp-plumed-ocl-python3" module is available to load.

    CCE-GPU-2.0.0
    CPE-23.02-cce-15.0.1-GPU-softs

This tells us that gromacs/2023-mpi-omp-plumed-ocl-python3 is available, under the CCE-GPU-2.0.0 and CPE-23.02-cce-15.0.1-GPU-softs Spack environments. The last stable environment (i.e. production) should be preferably used.

In your scripts, the procedure above reduces to the following commands:

$ module purge
$ module load CCE-GPU-2.0.0
$ module load gromacs/2023-mpi-omp-plumed-ocl-python3

Advanced module mixing

As an example, assume you have spotted (using module spider) a library CINES built using Spack. Assume this library was build in a Spack environment using the Cray compiler but you use the GNU toolchain (PrgEnv-gnu). Common wisdom would tell you should not mix these two.

In practice, if the interfaces are properly defined, as it is in the library used in this example, and if the library does not have toolchain dependencies such as an OpenMP runtime (which you absolutely do not want to mix), then you should be able to have interoperability of the library across toolchains.

Note

Check the dependencies on shared libraries using readelf -d or ldd. It is trickier on static libraries.

Using CINES’ pre-configured (PreConf) Spack installation

Adastra provides a Spack module that allows you to load a pre-configured instance of Spack (adapted to Adastra’s architecture). This Spack instance will allow you to compile software while reusing the Cray programming environment toolchain and architecture. The software will be installed in a location determined by the ${SPACK_USER_PREFIX} environment variable. The variable defaults to ${HOME} and it is advisable to set it to an other value, especially if you’re installing on different Adastra partitions. While we reuse CINES’ support team build cache, we do not have the Spack PreConf using the CINES’ software stacks as upstream.

Pro and cons : You will be restricted to using a fixed version of Spack, allowing you to benefit from installations performed by the CINES teams, including access to the mirror and binary cache, resulting in inherently shorter installation times.

Getting started

To install software with Spack, perform the following steps:

  1. Initialize Spack.

We provide the following Spack PreConf modules that you can use to introduce CINES’ Spack configuration to your environment.

Targeted partition

Spack PreConf module

Configured toolchains

MI250 (GPU)

spack-MI250-<version>

CCE, GNU, ROCm

GENOA (CPU)

spack-GENOA-<version>

CCE, GNU

HPDA (GPU, visualization)

TBD

TBD

Note

We provide multiple compilers per Spack PreConf. This differs from the Spack environment mentioned in the previous section.

Note

A Spack environment or Spack PreConf will implicitly load Cray modules; a behavior you can observe using module show spack-[GENOA|MI250]-<version>. Running module purge before loading a Spack environment is thus recommended.

$ module purge
$ export SPACK_USER_PREFIX="${HOME}/spack-install-[GENOA|MI250]-<version>"
$ module load spack-[GENOA|MI250]-<version> # i.e.: spack-install-MI250-3.1.0
mkdir: created directory '<path_to_spack_configuration>/spack-install-MI250-3.1.0'
For more information on how spack user works on Adastra, please consult documentation https://dci.dci-gitlab.cines.fr/webextranet/user_support/index.html#installing-software-using-spack-on                                                                                                                        -adastra
SPACK_USER_PREFIX=<path_to_spack_configuration>/spack-install-MI250-3.1.0
SPACK_USER_CONFIG_PATH=<path_to_spack_configuration>/spack-install-MI250-3.1.0/config_user_spack

Note

We recommend that you set ${SPACK_USER_PREFIX} in an environment script or in your .bash_profile to avoid having to set it every time you want to use Spack.

Warning

Ensure that the ${SPACK_USER_PREFIX} variable is not too long (ambiguous) or you may get Spack errors of this kind: Error: Failed to install XXX due to SbangPathError: Install tree root is too long. Spack cannot patch shebang lines when script path length (YYY) exceeds limit (127).

When loading the Spack PreConf module, a fully preconfigured Spack instance will be placed where the ${SPACK_USER_PREFIX} points to.

  1. Consult the information that Spack has on the product you wish to install and in particular on the configuration options that Spack calls variant:

$ spack info kokkos

Let’s assume that we intend to setup this product using the ROCm toolchain. Upon reviewing the product information, it follows that the installation command could be:

$ spack install kokkos@4.2.00%rocmcc@5.7.1~aggressive_vectorization~compiler_warnings~cuda~debug~debug_bounds_check~debug_dualview_modify_check~deprecated_code~examples~hpx~hpx_async_dispatch~hwloc~ipo~memkind~numactl~openmp~openmptarget~pic+rocm+serial+shared~sycl~tests~threads~tuning~wrapper amdgpu_target==gfx90a build_system=cmake build_type=Release cxxstd=17 generator=ninja patches=145619e arch=linux-rhel8-zen3

The +rocm flag enables AMD GPU support. Additionally, we need to specify the GPU type: amdgpu_target==gfx90a (note the double equal signs, which have the special meaning of propagating the GPU target to all dependencies). While we specify the rocmcc@5.7.1 compiler, it’s hipcc of the ROCm toolchain that will be used for this build. The arch=linux-rhel8-zen3 tells Spack to target the Zen3 CPU architecture (say, enable AVX2).

  1. It is advisable to review the dependencies that Spack intends to install:

$ spack spec kokkos%rocmcc amdgpu_target=gfx90a arch=linuInput spec
--------------------------------
-   kokkos%rocmcc amdgpu_target=gfx90a arch=linux-rhel8-zen3

Concretized
--------------------------------
-   kokkos@4.2.00%rocmcc@5.7.1~aggressive_vectorization~compiler_warnings~cuda~debug~debug_bounds_check~debug_docm+serial+shared~sycl~tests~threads~tuning~wrapper amdgpu_target=gfx90a build_system=cmake build_type=Release c
-       ^cmake@3.27.7%rocmcc@5.7.1~doc+ncurses+ownlibs build_system=generic build_type=Release arch=linux-rhel8
-           ^curl@8.4.0%rocmcc@5.7.1~gssapi~ldap~libidn2~librtmp~libssh~libssh2+nghttp2 build_system=autotools
-               ^mbedtls@3.3.0%rocmcc@5.7.1+pic build_system=makefile build_type=Release libs=static arch=linux
-               ^nghttp2@1.57.0%rocmcc@5.7.1 build_system=autotools arch=linux-rhel8-zen3
-               ^pkgconf@1.9.5%rocmcc@5.7.1 build_system=autotools arch=linux-rhel8-zen3
-           ^ncurses@6.4%rocmcc@5.7.1~symlinks+termlib abi=none build_system=autotools arch=linux-rhel8-zen3
-           ^zlib-ng@2.1.4%rocmcc@5.7.1+compat+opt build_system=autotools arch=linux-rhel8-zen3
-       ^gmake@4.4.1%rocmcc@5.7.1~guile build_system=generic arch=linux-rhel8-zen3
[e]      ^hip@5.7.1%rocmcc@5.7.1~cuda+rocm build_system=cmake build_type=Release generator=make patches=3f783ae,
[e]      ^hsa-rocr-dev@5.7.1%rocmcc@5.7.1+image+shared build_system=cmake build_type=Release generator=make arch
[e]      ^llvm-amdgpu@5.7.1%rocmcc@5.7.1~link_llvm_dylib~llvm_dylib~openmp+rocm-device-libs build_system=cmake

In your Spack instance, packages you installed will be denoted by a [+] in the first column, while packages already provided by Spack and cached from previous builds will display [^]. A - indicates that Spack did not find the package and will proceed to build it. A [e] indicates that the product is external to Spack, typical a system product, in our case, it is looking for ROCm libraries.

  1. Once you’ve reviewed Spack’s plan and are satisfied with it, proceed to install the packages.

Warning

We are experiencing some signature inconsistency, so to workaround issues such as Error: Failed to install XXX due to NoVerifyException, specify --no-check-signature in the spack install command. Alternatively, you can use --no-cache to bypass the CINES’ build cache but you will experience longer build time.

$ spack install kokkos@4.2.00%rocmcc@5.7.1~aggressive_vectorization~compiler_warnings~cuda~debug~debug_bounds_check~debug_dualview_modify_check~deprecated_code~examples~hpx~hpx_async_dispatch~hwloc~ipo~memkind~numactl~openmp~openmptarget~pic+rocm+serial+shared~sycl~tests~threads~tuning~wrapper amdgpu_target==gfx90a build_system=cmake build_type=Release cxxstd=17 generator=ninja patches=145619e arch=linux-rhel8-zen3
...
Stage: 0.18s.  Cmake: 4.19s.  Build: 6.30s.  Install: 3.03s.  Post-install: 1.03s.  Total: 14.99s
[+] <path_to_spack_configuration>/spack-install-MI250-3.1.0/linux-rhel8-zen3/rocmcc-5.7.1/kokkos-4.2.00-52g6

The last line indicates the location of the software installation on the disk.

  1. After successful installation of the product, you may want to ask Spack to forcefully generate modules.

$ spack module tcl refresh --delete-tree --yes-to-all

Then, you can find your product in your module environment:

$ module available kokkos/4.2.00
---- <path_to_spack_configuration>/spack-install-MI250-3.1.0/modules/tcl/linux-rhel8-zen3 ----
rocmcc/5.7.1/zen3/kokkos/4.2.00

If you wish to utilize the modules generated by this Spack instance without loading the associated modules (spack-[GENOA|MI250]-<version>), you can simply add the path of your modules to the ${MODULEPATH} (they should be under ${SPACK_USER_PREFIX}/modules/tcl/linux-rhel8-[zen3|zen4]).

$ # For MI250:
$ module use "${SPACK_USER_PREFIX}/modules/tcl/linux-rhel8-zen3"
$ # For GENOA:
$ module use "${SPACK_USER_PREFIX}/modules/tcl/linux-rhel8-zen4"

If a Spack install fails

If you encounter a failed Spack installation, consider examining the error message displayed. Additionally, Spack may direct you to an installation log for the specific product, which can be found in the same directory. The complete build directory is also available in /tmp. Examining the configure or cmake output logs may sometimes yield fruitful results. Finally, the spack -d install -v command may prove useful.

Here are some tips if the installation fails:

  • Modify the software version;

  • Modify the compiler : If the compiler is not specified, it takes by default the cce compiler (the recommended compiler for installing our software on Cray systems). If the installation fails, try with another compiler;

  • Disable variants that seem to cause problems;

  • Modify the dependencies used to build the target product (also called package);

  • Edit package.py: take a look at the package.py of the package that crashes: spack edit package_name. Unfortunately the process is not always simple since the package repository is situated in a read-only location. In these situations, it is necessary to clone your own Spack instance and configure it using our configuration files.

Assistance is available: you may refer to the official Spack documentation, open a ticket at the CINES help desk, or seek help from the Spack community via the Spack Slack.

Limitations

  • Spack evolves quickly, the tagged Spack version (i.e.: 0.22) are always lacking compared to the develop Spack branch;

  • you may have to specify --no-check-signature.

Further reading

CrayPE basics

The CrayPE is often feared due to its apparent complexity. We will try to present the basic building blocs and show how assembling these blocs.

At a high level, the a Cray environment is made up of:

  • External libraries (such as the ones in ROCm);

  • Cray libraries (MPICH, libsci);

  • Architecture modules (craype-accel-amd-gfx90a);

  • Compilers (craycc as the cce module, amdclang as the amd module, gcc as gnu module);

  • The Cray compiler wrappers (cc, CC, ftn) offered by the craype module;

  • The PrgEnv modules (PrgEnv-cray);

  • And the cpe/XX.YY.

The external libraries refer to libraries the CrayPE requires but are not the property of Cray, AMD’s ROCm is such an example. The Cray libraries are closed source software, there are multiple variants of the same library to accommodate for the GPU and many compiler support. The architecture modules will change the wrapper’s behavior (see Cray compiler wrapper) by helping choosing which library to link against (say, the MPICH GPU plugin), or modifying the flags such as -march=zen4. The compilers are not recommended to be directly used; they should instead be used through the Cray compiler wrapper which will interpret the PrgEnv, the loaded Cray library and architecture modules to handle the compatibility matrix transparently (with few visible artifacts). The PrgEnv are preset environments, you can choose to use them or cherry-pick you own set of module, at your own risk. The cpe/XX.YY modules are used to change the default version of the above mentioned modules and allows you to operate a set of intercompatible default modules.

../_images/craype_interactions.png

Graphical representation of the CrayPE component interactions.

Note

There is an order in which we recommend loading the modules. See the note in Targeting an architecture.

Important

Do not forget to export the appropriate environment variable such as CC, CXX etc. and make them point to the correct compiler or Cray compiler wrapper by loading the correct PrgEnv. This is can be crucial for tools like CMake and Make.

Changing CrayPE version

A Cray Programming Environment (CrayPE) can be simply viewed as a set of module (of a particular version). Switching CrayPE is like switching modules and defining new versions.

You can load a cpe/XX.YY module to prepare your environment with the modules associated to a specific XX.YY version of cpe. In practice, it will change the version of your loaded modules to match the version the cpe/XX.YY in question is expecting and, in addition, will modify the default version of the Cray modules.

Warning

If you use a cpe/XX.YY module, it must come first before you load any other Cray modules.

Important

You can preload a cpe/XX.YY module before preparing your environment to be sure you are using the correct version of the modules you load.

As an example:

 1$ module available cpe
 2-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
 3    cpe/22.11    cpe/22.12    cpe/23.02 (D)
 4$ module purge
 5-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
 6    cce/15.0.0    cce/15.0.1 (D)
 7$ module load PrgEnv-cray
 8$ module list
 9Currently Loaded Modules:
10    1) cce/15.0.1   2) craype/2.7.19   3) cray-dsmml/0.2.2
11    2) libfabric/1.15.2.0   5) craype-network-ofi   6) cray-mpich/8.1.24
12    3) cray-libsci/23.02.1.1   8) PrgEnv-cray/8.3.3
13$ module load cpe/22.12
14The following have been reloaded with a version change:
15  1) cce/15.0.1 => cce/15.0.0
16  2) cray-libsci/23.02.1.1 => cray-libsci/22.12.1.1
17  3) cray-mpich/8.1.24 => cray-mpich/8.1.23
18$ module available cce
19-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
20    cce/15.0.0 (L,D)    cce/15.0.1
21$ module load cpe/23.02
22Unloading the cpe module is insufficient to restore the system defaults.
23Please run 'source /opt/cray/pe/cpe/22.12/restore_lmod_system_defaults.[csh|sh]'.
24
25The following have been reloaded with a version change:
26  1) cce/15.0.0 => cce/15.0.1
27  2) cpe/22.12 => cpe/23.02
28  3) cray-libsci/22.12.1.1 => cray-libsci/23.02.1.1
29  4) cray-mpich/8.1.23 => cray-mpich/8.1.24
30$ module available cce
31-------------------- /opt/cray/pe/lmod/modulefiles/core --------------------
32    cce/15.0.0    cce/15.0.1 (L,D)

As we can see, the cpe/22.12 changed the modules version and also changed the default modules version.

Note

Loading a cpe module will lead to a quirk which is shown line 22. The quirks comes from the fact that unloading a module that switches modules does not bring the environment back to it states before the switching, in fact, it does nothing. Once the module is unloaded, the default module version are restored but we have to load them back. This is the role of the above mentioned script (restore_lmod_system_defaults.sh).

Cray compiler wrapper

As you may know, compatibilities between compilers and libraries is not always guaranteed and a compatibility matrix can be given to the user who are left to themselves to figure out how to combine the software components. Loading the PrgEnv-<compiler>[-<compiler2>] module introduces a compiler wrapper (also called driver) which will interpret environment variables introduced by other Cray modules such as craype-accel-amd-gfx90a (see Targeting an architecture for more details), cray-mpich, etc.. The driver creates the toolchain needed to satisfy the request (compilation, optimization, link, etc.). It also uses the information gathered in the environment to specify include paths, link flags, architecture specific flags, etc. that the underlying compiler needs to produce code. Effectively, theses compiler wrappers abstract the compatibility matrix away from the user; linking and providing the correct headers at compile and run time is only a subset of the features provided by the Cray compiler wrappers. If you do not use the wrappers, you will have to do more work and expose yourself to error prone manipulations.

PrgEnv and compilers

The compilers available on Adastra are provided through the Cray environment modules. Most of the readers already know about the GNU software stack. Adastra comes with three more supported compilers. The Cray and the AMD Radeon Open Compute (ROCm) compilers are both based on the state of the art LLVM Compiler Infrastructure. In fact you can treat these compilers as vendor recompiled Clang/Flang LLVM compilers with added optimization passes or OpenMP backend in the case of the Cray compiler (but not much more). The AMD Optimizing C/C++ Compiler (AOCC) compiler resemble the Intel ICC compiler, but for AMD. The AOCC compiler is based on LLVM. There is also a system (OS provided) versions of GCC available in /usr/bin (try not using it).

The Programming environment column of the table below represent the module to load to beneficiate from a specific environment. You can load a compiler module after loading a PrgEnv to choose a specific version of a compiler belonging to a given PrgEnv. That is, load cce/15.0.0 after loading PrgEnv-cray to make sure you get the cce/15.0.0 compiler. The modules loaded by a PrgEnv will change as the environment evolves. After the first load of a PrgEnv, you are recommended to save the module implicitly loaded (module list) and explicitly load them to avoid future breakage.

Vendor

Programming environment

Compiler module

Language

Compiler wrapper

Raw compiler

Usage and notes

Cray

PrgEnv-cray

cce

C

cc

craycc

For CPU and GPU compilations. craycc and craycxx are LLVM based while crayftn is entirely proprietary. cce means Cray Compiling Environment.

C++

CC

craycxx or crayCC

Fortran

ftn

crayftn

AMD

PrgEnv-amd

amd

C

cc

amdclang

For CPU and GPU compilations. This module introduces the ROCm stack. ROCm is AMD’s GPGPU software stack. These compilers are open source and available on Github. You can contact AMD via Github issues.

C++

CC

amdclang++

Fortran

ftn

amdflang (new Flang)

AMD

PrgEnv-aocc

aocc

C

cc

clang

For CPU compilations. These compilers are LLVM based but the LLVM fork are not open sourced.

C++

CC

clang++

Fortran

ftn

flang (classic Flang)

GCC

PrgEnv-gnu

gcc

C

cc

gcc

For CPU compilations.

C++

CC

g++

Fortran

ftn

gfortran

Note

Reading (and understanding) the craycc or crayftn man pages will provide you with valuable knowledge on the usage of the Cray compilers.

Important

It is highly recommended to use the Cray compiler wrappers (cc, CC, and ftn) whenever possible. These are provided whichever programming environment is used. These wrappers are somewhat like the mpicc provided by other vendors.

Switching compiler is as simple as loading an other PrgEnv. The user only needs to recompile the software, assuming the build scripts or build script generator scripts (say CMake scripts) are properly engineered.

For CPU compilations, we recommend PrgEnv-gnu.

For GPU compilations, we recommend one of PrgEnv-cray, PrgEnv-amd or potentially PrgEnv-gnu with rocm. These compilers can also compile for CPU but there may be more interesting optimizing compilers for such purpose.

To know which compiler/PrgEnv to use depending on the parallelization technology your program relies on (OpenMP, OpenACC, HIP, etc.), check this table.

Note

Understand that while both AMD softwares, PrgEnv-amd and PrgEnv-aocc target a fundamentally different node kind, the first one is part of the ROCm stack (analogous to NVHPC), the second one is an historical CPU compiler (analogous to Intel’s ICC).

The PrgEnv-cray (CCE), PrgEnv-amd (ROCm), PrgEnv-gnu, PrgEnv-aocc and PrgEnv-aocc all support the following C++ standards (and implied C standards): c++11, gnu++11, c++14, gnu++14, c++17, gnu++17, c++20, gnu++20, c++2b, gnu++2b. Some caveats exist regarding C++ modules with C++20. they are all (expect GNU), based on Clang. Fortran standards: f90, f95, f03.

Warning

If your code has, all along its life, relied on non standard, vendor specific extensions, you may have issues using the Cray fortran compiler who tends to be stricter.

PrgEnv mixing and subtleties

Cray provides the PrgEnv-<compiler>[-<compiler2>] modules (say, PrgEnv-cray-amd) that load a given <compiler> and toolchain and optionally, if set, introduce an additional <compiler2>. In case a <compiler2> is specified, the Cray environment will use <compiler> to compile Fortran sources and <compiler2> for C and C++ sources. The user can then enrich his environment by loading other libraries through modules (though some of these libraries are loaded by default with the PrgEnv).

Introducing an environment, toolchain or tool through the use of modules means that loading a module will modify environment variables such as ${PATH}, ${ROCM_PATH}, ${LD_LIBRAY_PATH} to make the tool or toolchain available to the user’s shell.

For example, say you wish to use the Cray compiler to compile CPU or GPU code, introduce the CCE toolchain this way:

$ module load PrgEnv-cray

Say you want to use the Cray compiler to compile Fortran sources and use the AMD compiler for C and C++ sources, introduce the CCE and ROCm toolchains this way:

$ module load PrgEnv-cray-amd

Say you want to use the AMD compiler to compile CPU or GPU code, introduce the ROCm toolchain this way:

$ module load PrgEnv-amd

Mixing PrgEnv and toolchain

Say you want to use the Cray compiler to compile CPU or GPU code and also have access to the ROCm tools and libraries, introduce the CCE and ROCm tooling this way:

$ module load PrgEnv-cray amd-mixed

Mixing compilers and tooling is achieved through the *-mixed modules. *-mixed modules do not significantly alter the Cray compiler wrapper’s behavior. They can be used to steer the compiler in using, say, the correct ROCm version instead of the default one (/opt/rocm).

*-mixed modules can be viewed as an alias to the underlying software. For instance, amd-mixed would be an alias for the rocm module.

Targeting an architecture

In a Cray environment, one can load modules to target architectures instead of adding compiler flags explicitly.

On Adastra’s accelerated nodes, we have AMD-Trento (host CPU) and AMD-MI250X (accelerator) as the two target architectures. The command module available craype- will show all the installed modules for available target architectures. For AMD-Trento the module is craype-x86-trento and for AMD-MI250X it would be craype-accel-amd-gfx90a. These two modules add environment variables used by the Cray compiler wrapper to trigger flags used by the compilers to optimize or produce code for these two architectures.

For example, to setup a GPU programming environment:

$ module purge
$ # A CrayPE environment version
$ module load cpe/23.12
$ # An architecture
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ # A compiler to target the architecture
$ module load PrgEnv-cray
$ # Some architecture related libraries and tools
$ module load amd-mixed

You get a C/C++/Fortran compiler configured to compile for Trento CPUs and MI250X GPUs and automatically link with the appropriate Cray MPICH release, that is, if you use the Cray compiler wrappers.

Warning

If you get a warning such as this one Load a valid targeting module or set CRAY_CPU_TARGET, it is probably because you did not load a craype-x86-<architecture> module.

Note

Try to always load, first, the CPU and GPU architectures (craype-x86-trento for the GENOA partition and craype-x86-trento, craype-accel-amd-gfx90a for the MI250 partition), then the PrgEnv and the rest of your modules.

Intra-process parallelization technologies

When you are not satisfied with the high level tools such as the vendor optimized BLAS, you have the option to program the machine by yourself. These technology are harder to use, more error prone but more versatile. Some technologies are given below, but the list is obviously not complete.

We could define at least two class of accelerator programming technologies. The ones based on directive (say, pragma omp parallel for) and the ones base on kernels. A kernel is a treatment, generally the inner loops or body of the inner loops of what you would write on a serial code. The kernel is given data to transform and is explicitly mapped to the hardware compute units.

Note

NVHPC is Nvidia’s GPU software stack, ROCm is AMD’s GPU software stack (amd-mixed or PrgEnv-amd), CCE is part of CPE which is Cray’s CPU/GPU compiler toolchain (PrgEnv-cray), LLVM is your plain old LLVM toolchain, OneAPI is Intel’s new CPU/GPU Sycl based software stack (contains the DPC++, aka Sycl compiler).

For C/C++ codes

Class

Name

Compiler support on AMD GPUs

Compiler support on Nvidia GPUs

Compiler support on Intel GPUs

Compiler support on x86 CPUs

Fine tuning

Implementation complexity/maintainability

Community support/availability (expected longevity in years)

Directive

OpenACC v2

NVHPC

NVHPC

Low-medium

Low

Medium/high (+5 y)

OpenMP v5

LLVM/ROCm/CCE

LLVM/NVHPC/CCE

OneAPI

LLVM/ROCm/NVHPC/CCE/OneAPI

Low-medium

Low

High (+10 y)

Kernel

Sycl

AdaptiveCPP/OneAPI

AdaptiveCPP/OneAPI

AdaptiveCPP/OneAPI

AdaptiveCPP/OneAPI

High

Medium/high

High (+10 y)

CUDA/HIP

LLVM/ROCm/CCE

LLVM/NVHPC/CCE

High

Medium/high

High (+10 y)

Kokkos

LLVM/ROCm/CCE/AdaptiveCPP/OneAPI

LLVM/NVHPC/CCE/AdaptiveCPP/OneAPI

AdaptiveCPP/OneAPI

LLVM/ROCm/NVHPC/CCE/AdaptiveCPP/OneAPI

Medium/high

Low/medium

Medium (+5 y)

Sycl, the Khronos consortium’ successor to OpenCL is quite complex, like its predecessor. Obviously, time will tell if it is worth investing in this technology but there is a significant ongoing open standardization effort.

Kokkos in itself is not on the same level as OpenACC, OpenMP, Cuda/HIP or Sycl because it serves as an abstraction of all theses.

Note

Cray’s CCE, AMD’s ROCm, Intel’s OneAPI (intel-llvm) and LLVM’s Clang share the same front end (what reads the code). Most are just a recompiled/extended version Clang with, generally open source. Cray’ C/C++ compiler is a Clang compiler with a modified proprietary backend (code optimization and library such as the OpenMP backend implementation).

For Fortran codes

Class

Name

Compiler support on AMD GPUs

Compiler support on Nvidia GPUs

Compiler support on Intel GPUs

Compiler support on x86 CPUs

Fine tuning

Implementation complexity/maintainability

Community support/availability (expected longevity in years)

Directive

OpenACC v2

CCE

NVHPC

NVHPC

Low-medium

Low

Medium/High (+5 y)

OpenMP v5

ROCm/CCE/LLVM

NVHPC/CCE/LLVM

OneAPI

ROCm/NVHPC/CCE/LLVM/OneAPI

Low-medium

Low

High (+10 y)

Kernel

AMD - Here, means the AMD stack, be it the AOCC compiler or the ROCm toolchain.
Intel - Here, means the Intel stack, be it the ICC compiler or the One API toolchain.

Some wrapper, compiler and linker flags

Flags conversion for Fortran program

Intel’s ifort

GNU’s gfortran

Cray’s crayftn

Note

-g

Embed debug info into the binary. Useful for stack trace and GDB.

-Og

-eD

Compile in debug mode. The crayftn option does a lot more than adding debug info though.

-O1

-O1

-O1

-O2

-O2

-O1

-O3

-O3

-O2

-fast

-Ofast

-O3

-xHost

-march=native

-h cpu=<> defined by the craype-* modules.

Careful, this flags assumes the machine on which you compile has similar CPUs to the one where your code run.

-integer-size 32

-s integer32

-integer-size 64

-fdefault-integer-8

-s integer64

-real-size 64

-fdefault-real-8

-s real64

-ftz

ieee_support_underflow_control/ ieee_set_underflow_mode

ieee_support_underflow_control/ ieee_set_underflow_mode

Flush denormal To Zero. If well designed, your code should not be very sensible to that. See the Fortran 2003 standard.

-convert big_endian

-fconvert=big-endian

-fpe0

-ffpe-trap=invalid,zero,overflow

~ -hfp_trap -K trap=divz,inv,ovf

For debug build only.

-flto=thin

-hipa3

Link Time Optimization (LTO) sometime called InterProcedural Optimization (IPO) or IPA.

Debugging with crayftn

Note

To flush the output stream (stdout) is a standard way, use the output_unit named constant in the ISO_Fortran_env module. E.G.: flush(output_unit). This is useful when debugging using the classic print/comment approach.

Feature/flag/environnement variable

Explanation

-eD

The -eD option enables all debugging options. This option is equivalent to specifying the -G0 option with the -m2, -rl, -R bcdsp, and -e0 options.

-e0

Initializes all undefined local stack, static, and heap variables to 0 (zero). If a user variable is of type character, it is initialized to NUL. If logical, initialized to false. The stack variables are initialized upon each execution of the procedure. When used in combination with -ei, Real and Complex variables are initialized to signaling NaNs, while all other typed objects are initialized to 0. Objects in common blocks will be initialized if the common block is declared within a BLOCKDATA program unit compiled with this option.

-en

Generates messages to note nonstandard Fortran usage.

-hfp0=noapprox

Controls the level of floating point optimizations, where n is a value between 0 and 4, with 0 giving the compiler minimum freedom to optimize floating point operations and 4 giving it maximum freedom. noapprox prevents rewrites of square root and divide expressions using hardware reciprocal approximations.

-hflex_mp=intolerant

Has the highest probability of repeatable results, but also the highest performance penalty.

-hlist=m

Produces a source listing with loopmark information. To provide a more complete report, this option automatically enables the -O negmsg option to show why loops were not optimized. If you do not require this information, use the -O nonegmsg option on the same command line. Loopmark information will not be displayed if the -d B option has been specified.

-hlist=a

Include all reports in the listing (including source, cross references, options, lint, loopmarks, common block, and options used during compilation).

-hbounds

Enable bound checking.

crayftn also offers sanitizers which turn on runtime checks for various forms of undefined or suspicious behavior. This is an experimental feature (in CrayFTN 17). If a check fails, a diagnostic message is produced at runtime explaining the problem.

Feature/flag/environnement variable

Explanation

-fsanitize=address

Enables a memory error detector.

-fsanitize=thread

Enables a data race detector.

Further reading: man crayftn.

Making the Cray wrappers spew their implicit flags

Assuming you have loaded an environment such as:

$ module purge
$ # A CrayPE environment version
$ module load cpe/23.12
$ # An architecture
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ # A compiler to target the architecture
$ module load PrgEnv-cray

The CC, cc and ftn Cray wrappers imply a lot of flags that you may want to retrieve. This can be done like so:

$ CC --cray-print-opts=cflags
-I/opt/cray/pe/libsci/23.12.5/CRAY/16.0/x86_64/include -I/opt/cray/pe/mpich/8.1.28/ofi/cray/16.0/include -I/opt/cray/pe/dsmml/0.2.2/dsmml/include
$ CC --cray-print-opts=libs
-L/opt/cray/pe/libsci/23.12.5/CRAY/16.0/x86_64/lib -L/opt/cray/pe/mpich/8.1.28/ofi/cray/16.0/lib -L/opt/cray/pe/mpich/8.1.28/gtl/lib -L/opt/cray/pe/dsmml/0.2.2/dsmml/lib -Wl,--as-needed,-lsci_cray_mpi,--no-as-needed -lmpi_gtl_hsa -Wl,--as-needed,-lsci_cray,--no-as-needed -ldl -Wl,--as-needed,-lmpi_cray,--no-as-needed -lmpi_gtl_hsa -Wl,--as-needed,-ldsmml,--no-as-needed -L/opt/cray/pe/cce/17.0.0/cce/x86_64/lib/pkgconfig/../ -Wl,--as-needed,-lstdc++,--no-as-needed -Wl,--as-needed,-lpgas-shmem,--no-as-needed -lfi -lquadmath -lmodules -lfi -lcraymath -lf -lu -lcsup

We observe the implied compile and link flags for Cray MPICH (the GTL is here too) and the LibSci. Had you used a cray-hdf5 or some other Cray modules libraries, it would have been in commands’ output.

Note

The Cray wrappers use -I and not -isystem which is suboptimal for strict code using many warning flags (as it should be).

Note

Use the -craype-verbose flag to display the command line produced by the Cray wrapper. This must be called on a file to see the full output (i.e., CC -craype-verbose test.cpp). You may also try the --verbose flag to ask the underlying compiler to show the command it itself launches.

crayftn optimization level details

Now we provide a list of the differences between the flags implicitly enabled when either -O1, -O2 or -O3. Understand that -O3 under the crayftn compiler is very aggressive and could be said to at least equate -Ofast under your typical Clang or GCC when it comes to the floating point optimizations.

Warning

Cray reserves the right to change, for a new crayftn version, the options enabled through -On.

The options given below are bound to Cray Fortran : Version 15.0.1. This may change with past and future versions.

O1 provides:

-h scalar1,vector1,unroll2,fusion2,cache0,cblock0,noaggress
-h ipa1,mpi0,pattern,modinline
-h fp2=approx,flex_mp=default,alias=default:standard_restrict
-h fma
-h autoprefetch,noconcurrent,nooverindex,shortcircuit2
-h noadd_paren,nozeroinc,noheap_allocate
-h align_arrays,nocontiguous,nocontiguous_pointer
-h nocontiguous_assumed_shape
-h fortran_ptr_alias,fortran_ptr_overlap
-h thread1,nothread_do_concurrent,noautothread,safe_addr
-h noomp -f openmp-simd
-h caf,noacc
-h nofunc_trace,noomp_analyze,noomp_trace,nopat_trace
-h nobounds
-h nomsgs,nonegmsgs,novector_classic
-h dynamic
-h cpu=x86-64,x86-trento,network=slingshot10
-h nofp_trap -K trap=none
-s default32
-d 0abcdefgijnpvxzBDEGINPQSZ
-e hmqwACFKRTX

The discrepancies shown between O1 and O2 is used are:

-h scalar2,vector2
-h ipa3
-h thread2

The discrepancies shown between O2 and O3 or Ofast are used are:

-h scalar3,vector3
-h ipa4
-h fp3=approx

AOCC flags

AMD gives a detailed description of the CPU optimization flags here: https://rocm.docs.amd.com/en/docs-5.5.1/reference/rocmcc/rocmcc.html#amd-optimizations-for-zen-architectures.

Advanced tips and flags and environment variable for debugging

See LLVM Optimization Remarks by Ofek Shilon for more details on what Clang can tell you about how it optimizes you code and what tools are available to process that information.

Note

The crayftn compiler does not provide an option to trigger debug info generation while also, not lowering optimization.

Note

The crayftn compiler possess an extremely powerful optimizer which does of the most aggressive optimizations a compiler can afford to do. This means that using a high optimization level, the optimizer will assume your code has a strong standard compliance. Any slight deviation from the standard can lead to significant issue in the code, from crash to silent corruption. crayftn’s -O2 is considered stable, safe and comparable to the -O3 of other compilers. -hipa4 has lead to issues in some codes. crayftn also has his share of internal bugs which can mess up your code too.

Running jobs

SLURM is the workload manager used to interact with the compute nodes on Adastra. In the following subsections, the most commonly used SLURM commands for submitting, running, and monitoring jobs will be covered, but users are encouraged to visit the official documentation and man pages for more information. This section describes how to run programs on the Adastra compute nodes, including a brief overview of SLURM and also how to map processes and threads to CPU cores and GPUs.

The SLURM batch scheduler and job launcher

SLURM provides multiple ways of submitting and launching jobs on Adastra’s compute nodes: batch scripts, interactive, and single-command. The SLURM commands allowing these methods are shown in the table below and examples of their use can be found in the related subsections. Please note that regardless of the submission method used, the job will launch on compute nodes, with the first node in the allocation serving as head-node.

With SLURM, you first ask for resources (a number of node, of GPU, of CPU) and then you distribute these resources on your tasks.

sbatch

Used to submit a batch script. The batch script can contain information on the amount of resources to allocate and how to distribute them. Options can be specified when specifying the sbatch command flags or inside the script, at the top of the file after the following prefix #SBATCH. The sbatch options do not necessarily lead to the resource distribution per rank that you would expect (!). sbatch allocates, srun distributes.
See Batch scripts for more details.
srun

Used to run a parallel job (job step) on the resources allocated with sbatch or salloc.
If necessary, srun will first create a resource allocation in which to run the parallel job(s).
salloc

Used to allocate an interactive SLURM job allocation, where one or more job steps (i.e., srun commands) can then be launched on the allocated resources (i.e., nodes).
See Interactive jobs for more details.

Batch scripts

A batch script can be used to submit a job to run on the compute nodes at a later time (the module used in the scripts below are here as an indication, you may not need them if you use PyTorch, Tensorflow or the CINES Spack modules). In this case, stdout and stderr will be written to a file(s) that can be opened after the job completes. Here is an example of a simple batch script for the GPU (MI250) partition:

 1#!/bin/bash
 2#SBATCH --account=<account_to_charge>
 3#SBATCH --job-name="<job_name>"
 4#SBATCH --constraint=MI250
 5#SBATCH --nodes=1
 6#SBATCH --exclusive
 7#SBATCH --time=1:00:00
 8
 9module purge
10
11# A CrayPE environment version
12module load cpe/23.12
13# An architecture
14module load craype-accel-amd-gfx90a craype-x86-trento
15# A compiler to target the architecture
16module load PrgEnv-cray
17# Some architecture related libraries and tools
18module load amd-mixed
19
20module list
21
22export MPICH_GPU_SUPPORT_ENABLED=1
23
24# export OMP_<ICV=XXX>
25
26srun --ntasks-per-node=8 --cpus-per-task=8 --threads-per-core=1 --gpu-bind=closest -- <executable> <arguments>

Here is an example of a simple batch script for the CPU (GENOA) partition:

 1#!/bin/bash
 2#SBATCH --account=<account_to_charge>
 3#SBATCH --job-name="<job_name>"
 4#SBATCH --constraint=GENOA
 5#SBATCH --nodes=1
 6#SBATCH --exclusive
 7#SBATCH --time=1:00:00
 8
 9module purge
10
11# A CrayPE environment version
12module load cpe/23.12
13# An architecture
14module load craype-x86-genoa
15# A compiler to target the architecture
16module load PrgEnv-cray
17
18module list
19
20
21
22
23
24# export OMP_<ICV=XXX>
25
26srun --ntasks-per-node=24 --cpus-per-task=8 --threads-per-core=1 -- <executable> <arguments>

Assuming the file is called job.sh on the disk, you would launch it like so: sbatch job.sh.

Options encountered after the first non-comment line will not be read by SLURM. In the example script, the lines are:

Line

Description

1

Shell interpreter line.

2

GENCI/DARI project to charge. More on that below.

3

Job name.

4

Type of Adastra node requested (here, the GPU MI250 or CPU GENOA partition).

5

Number of compute nodes requested.

6

Ask SLURM to reserve whole nodes. If this is not wanted, see Shared mode vs exclusive mode.

7

Specify where the stderr and stdout streams should be saved to disk.

8

Wall time requested (HH:MM:SS).

10-19

Setup the module environment, always starting with a purge.

21

(For the MI250 partition script) Enable GPU aware MPI. You can pass GPU buffers directly to the MPI APIs.

25

Potentially, setup some OpenMP environment variables.

27

Implicitly ask to use all of the node allocated. Then we distribute the work on 8 or 24 tasks per node. We also specify that the tasks should be bound to 8 cores, without Simultaneous Multithreading (SMT) and to the closest GPU to these 8 cores.

The SLURM submission options are preceded by #SBATCH, making them appear as comments to a shell (since comments begin with #). SLURM will look for submission options from the first line through the first non-comment line. The mandatory SLURM flags are, the account identifier (also called project ID or project name and specified via --account=), more on that later, the type of node (via --constraint=), the maximal job runtime duration (via --time=) and the number of nodes (via --nodes=).

Some more advanced scripts are available in this document and this repository (though, the scripts of this repository are quite old).

Warning

A proper binding is often critical for HPC applications. We strongly recommend that you either make sure your binding is correct (say, using this tool hello_cpu_binding) or that you take a look at the binding scripts presented in Proper binding, why and how.

Note

The binding srun does is only able to restrict a rank to a set of thread (process affinity towards hardware threads). It does not do what is called thread pinning/affinity. To exploit thread pinning, you may want to check OpenMP’s ${OMP_PROC_BIND} and ${OMP_PLACES} Internal Control Variables (ICVs)/environment variables. Bad thread pinning can be detrimental to performance. Check this document for more details.

The typical OpenMP ICVs used prevent and diagnose thread affinity issues rely on the following environment variable:

# Logs the rank to core/thread placement is correct.
export OMP_DISPLAY_AFFINITY=TRUE
export OMP_PROC_BIND=CLOSE
export OMP_PLACES=THREADS
# This should be redundant because srun already restrict the rank's CPU
# access.
export OMP_NUM_THREADS=<N>

Common SLURM submission options

The table below summarizes commonly-used SLURM job submission options:

Command (long or short)

Description

--account=<account_to_charge> or -A <account_to_charge>

Account identifier (also called project ID) to use and charge for the compute resources consumption. More on that below.

--constraint=<node_type>

Type of Adastra node. The accepted values are MI250, GENOA and HPDA. The first two values represent the two main partitions of Adastra.

--time=<maximum_duration> or -t <maximum_duration>

Maximum duration as wall clock time HH:MM:SS.

--nodes=<number_of_nodes> or -N <number_of_nodes>

Number of compute nodes.

--job-name="<job_name>" or -J <job_name>

Name of job.

--output=<file_name> or -o <file_name>

Standard output file name.

--error=<file_name> or -e <file_name>

Standard error file name.

For more information about these or other options, please see the sbatch man page.

Resource consumption and charging

French computing site resources are represented in hour of use of a given resource type. For instance, at CINES, if you have been given 100’000 hours on Adastra’s MI250X partition, it means that you could use a single unit of MI250X resource for 100’000 hours. It also means that you could use 400 units of MI250X resource for 250 hours. The units are given below:

Computing resource

Unit description

MI250X partition

2 GCD (GPU) of an MI250X card, that is, a whole MI250X.

GENOA partition

1 core (2 logical threads).

The resources you will consume will have to be charged to a project. Multiple times in this document have we invoked the --account=<account_to_charge> SLURM flag. Before submitting the job, make sure you have set a valid <account_to_charge>. You can obtain the list of account you are attached to by running the myproject -l command. The values representing the account name you can charge are on the last line of the command output (i.e.: Liste des projets de calcul associés au user someuser : ['bae1234', 'eat4567', 'afk8901']). More on myproject in the Login unique section.

We do not charge for HPDA resources.

In addition, the <constraint> in --constraint=<constraint> should be set to a proper value as it is this SLURM flag that describes the kind of resource you will request and thus, that CINES will charge.

Note

To monitor your compute hours consumption, use the myproject --state [project] command or visit https://reser.cines.fr/.

Warning

The charging gets a little bit less simple when you use the shared nodes.

Shared mode vs exclusive mode

Some nodes are reserved for what we call share mode, which differs from the exclusive mode found in many of the batch scripts presented in this documentation (observe the --exclusive SLURM flag). The role of theses shared nodes is the following: when a resource allocation, be it through salloc or sbatch, asks for less than what a node offers, your allocation will automatically be rerouted to the pool of shared nodes (the [genoa|mi250]-shared SLURM partitions). On these, you may have to share the node with other users. That said, we maintain isolation of the resources, one job can not access the resources (GPUs/cores/memory) allocated by an other job (or other user), even if both jobs reside on the same node. For now, the shared mechanism is available on some nodes of both the MI250 and GENOA partitions and active on the whole HPDA partition.

On the GENOA and MI250 partitions, the smallest amount of core you can get charged for is 8. This is so that we map an allocation to a hardware resource (L3 cache), limiting the impact on users sharing the same node.

The memory and core charging computation is roughly defined like that:

CEIL(MAX(MEMORY_PER_NODE_ASKED/MEMORY_PER_NODE*CORE_PER_NODE; CORE_PER_NODE_ASKED); 8)
With:
    MEMORY_PER_NODE_ASKED and CORE_PER_NODE_ASKED being the resource amount you which to reserve.
    MEMORY_PER_NODE and CORE_PER_NODE the amount of resource a node has.
    CEIL(X; N) = ROUND_AWAY_FROM_ZERO(X/N)*N

As an example, if you ask for say, 2/4 of a node’s CPU cores and for 3/4 of a node’s memory we will charge you for 3/4 of a node’s CPU cores. Note that you you will not have access to 3/4 of the CPU cores even though we charge for it. To specify the amount of memory per node use --mem=<N>G where <N> is an amount in Gio.

Lets have an example, say you want to schedule a very small job running 2 MPI ranks, each using 8 cores of the GENOA partition. You could reserve a whole node via --exclusive but CINES would then charge you for all the 192 cores of the node, effectively wasting 176 cores; instead, to benefit from the shared nodes, you will not use the --exclusive flag:

$ salloc --account=<account_to_charge> --constraint=GENOA --job-name="interactive" --nodes=1 --ntasks-per-node=2 --cpus-per-task=8 --time=1:00:00
salloc: INFO : Considering its requirements, this job is treated in SHARED mode.
salloc: INFO : We cannot guarantee the performance reproducibility of such small jobs in this mode,
salloc: INFO : but they are only charged for the needed resources.
salloc:
salloc: INFO : As you didn't ask threads_per_core in your request: 2 was taken as default
salloc: INFO : This job requests 8 cores. Due to shared usage this job will be charged for 8 cores (of the 192 cores of the node)
salloc: Pending job allocation 41328
salloc: job 41328 queued and waiting for resources
salloc: job 41328 has been allocated resources
salloc: Granted job allocation 41328

Note the message written to your shell. We see that 8 cores will be allocated and charged, instead of the 192. Also, the shared mechanism informs us that we will have 2 threads per core (called hyperthreading or SMT) for a total of 16 threads available to your program. This may not be what one wants, indeed core and thread are different! In the example above we wanted 8 cores per rank, not 8 threads. We thus have to add the --threads-per-core=1 SLURM flag, giving:

$ salloc --account=<account_to_charge> --constraint=GENOA --job-name="interactive" --nodes=1 --ntasks-per-node=2 --cpus-per-task=8 --time=1:00:00 --threads-per-core=1
salloc: INFO : Considering its requirements, this job is treated in SHARED mode.
salloc: INFO : We cannot guarantee the performance reproducibility of such small jobs in this mode,
salloc: INFO : but they are only charged for the needed resources.
salloc:
salloc: INFO : This job requests 16 cores. Due to shared usage this job will be charged for 16 cores (of the 192 cores of the node)
salloc: Pending job allocation 41342
salloc: job 41342 queued and waiting for resources
salloc: job 41342 has been allocated resources
salloc: Granted job allocation 41342

Had we used the following command (note the --exclusive), CINES would have had to charge for the whole 192 cores:

$ salloc --account=<account_to_charge> --constraint=GENOA --job-name="interactive" --nodes=1 --ntasks-per-node=2 --cpus-per-task=8 --time=1:00:00 --exclusive
salloc: Pending job allocation 41382
salloc: job 41382 queued and waiting for resources
salloc: job 41382 has been allocated resources
salloc: Granted job allocation 41382

Warning

For the shared mode to work properly, you should not use the --exclusive flag, and you should explicitly specify your resource requirements in the sbatch header or in the salloc command arguments.

Warning

If a shared partition is completely exhausted your job will be pending. It may be that if you switch to exclusive mode, your job will start earlier due to the pool of non-shared node being less used. This differs from say, IDRIS’ JeanZay machine where all the GPU nodes are shared. We assume HPC is synonymous of large job, spanning and scaling over multiple nodes. The shared mode is for debugging purposes, code that do not scale, short duration computation (some script or post processing). If you run many small jobs that can, put together, fill a whole node, you should use a whole node, not a shared one; check this document to learn how.

Quality Of Service (QoS) queues

On Adastra, queues are transparent, CINES does not disclose the QoS. The user should not try to specify anything related to that subject (such as --qos=). The SLURM scheduler will automatically place your job in the right QoS depending on the duration and resource quantity asked.

Queue priority rules are harmonized between the 3 computing centers (CINES, IDRIS and TGCC). We give a higher priority is given to large jobs, as Adastra is primarily dedicated to running large HPC jobs. The SLURM fairshare concept is up and running meaning that a user that assuming a linear consumption, if a user is above the line, its priority will be lower than a user ho is below the line. We may artificially lower a user’s priority if we notice bad practices (such as launching thousands of small jobs on an HPC machine). Priorities are calculated over a sliding window of one week. With a little patience, your job will eventually be processed.

The best advice we can give you is to correctly size your jobs. First check which node configuration works best, adjust the number of MPI, OpenMP thread and binding on a single node. Then do some scaling tests. Finally, do not specify a SLURM ``–time`` argument larger than what you really need, this is the most common scheduler misconfiguration on the user’s side.

srun

The default job launcher for Adastra is srun . The srun command is used to execute an MPI enabled binary on one or more compute nodes in parallel. It is responsible for distributing the resources allocated by an salloc or sbatch command onto MPI ranks.

 $ srun  [OPTIONS... [executable [args...]]]
 $ srun --ntasks-per-node=24 --cpus-per-task=8 --threads-per-core=1 -- <executable> <arguments>
<output printed to terminal>

The output options have been removed since stdout and stderr are typically desired in the terminal window in this usage mode.

srun accepts the following common options:

-N, --nodes

Number of nodes

-n, --ntasks

Total number of MPI tasks (default is 1).

-c, --cpus-per-task=<ncpus>


Logical cores per MPI task (default is 1).
When used with --threads-per-core=1: -c is equivalent to physical cores per task.
We do not advise that you use this option when using --cpu-bind=none.
--cpu-bind=threads

Bind tasks to CPUs.
threads - (default, recommended) Automatically generate masks binding tasks to threads.
--threads-per-core=<threads>


In task layout, use the specified maximum number of hardware threads per core.
(default is 1; there are 2 hardware threads per physical CPU core).
Must also be set in salloc or sbatch if using --threads-per-core=2 in your srun command.

--kill-on-bad-exit=1

Try harder at killing the whole step if a process fails and return an error code different than 1.

-m, --distribution=<value>:<value>:<value>


Specifies the distribution of MPI ranks across compute nodes, sockets (L3 regions), and cores, respectively.
The default values are block:cyclic:cyclic, see man srun for more information.
Currently, the distribution setting for cores (the third <value> entry) has no effect on Adastra
--ntasks-per-node=<ntasks>

If used without -n: requests that a specific number of tasks be invoked on each node.
If used with -n: treated as a maximum count of tasks per node.

--gpus

Specify the number of GPUs required for the job (total GPUs across all nodes).

--gpus-per-node

Specify the number of GPUs per node required for the job.

--gpu-bind=closest

Binds each task to the GPU which is on the same NUMA domain as the CPU core the MPI rank is running on.
See the --gpu-bind=closest example in Proper binding, why and how for more details.
--gpu-bind=map_gpu:<list>





Bind tasks to specific GPUs by setting GPU masks on tasks (or ranks) as specified where
<list> is <gpu_id_for_task_0>,<gpu_id_for_task_1>,.... If the number of tasks (or
ranks) exceeds the number of elements in this list, elements in the list will be reused as
needed starting from the beginning of the list. To simplify support for large task
counts, the lists may follow a map with an asterisk and repetition count. (For example
map_gpu:0*4,1*4).

--ntasks-per-gpu=<ntasks>

Request that there are ntasks tasks invoked for every GPU.

--label

Prefix every written lines from stderr or stdout with <rank index>: where <rank index> starts at zero
and matches the MPI rank index that the writing process is.

Interactive jobs

Most users will find batch jobs an easy way to use the system, as they allow you to hand off a job to the scheduler, allowing them to focus on other tasks while their job waits in the queue and eventually runs. Occasionally, it is necessary to run interactively, especially when developing, testing, modifying or debugging a code.

Since all compute resources are managed and scheduled by SLURM, it is not possible to simply log into the system and immediately begin running parallel codes interactively. Rather, you must request the appropriate resources from SLURM and, if necessary, wait for them to become available. This is done through an interactive batch job. Interactive batch jobs are submitted with the salloc command. Resources are requested via the same options that are passed via #SBATCH in a regular batch script (but without the #SBATCH prefix). For example, to request an interactive batch job with the same resources that the batch script above requests, you would use salloc --account=<account_to_charge> --constraint=MI250 --job-name="<job_name>" --nodes=1 --time=1:00:00 --exclusive. Note that there is no option for an output file, you are running interactively, so standard output and standard error will be displayed to the terminal.

You can then run the command you would generally put in the batch script: srun --nodes=2 --ntasks-per-node=2 --cpus-per-task=8 --gpu-bind=closest -- <executable> <arguments>.

If you want to connect to the node, you can directly ssh on it, assuming you have allocated it.

Small job

Allocating a single GPU

The line below will allocate 1 GPU and 8 cores (no SMT), for 60 minutes.

$ srun \
      --account=<account_to_charge> \
      --constraint=MI250 \
      --nodes=1 \
      --time=1:00:00 \
      --gpus-per-node=1 \
      --ntasks-per-node=1 \
      --cpus-per-task=8 \
      --threads-per-core=1 \
      -- <executable> <arguments>

Note

This is more of a hack than a serious usage of SLURM concepts or of HPC resources.

Packing

Note

We strongly advise that you get familiar with Adastra’s SLURM’s queuing concepts.

If your workflow consist of many small jobs, you may rely on the shared mode. That said, if you run many small jobs that can, put together, fill a whole node, you should use a whole node, not a shared one. This may shorten your queue time as we have and want to keep a small shared node count.

This is how we propose you use a whole node:

#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="<job_name>"
#SBATCH --constraint=GENOA
#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --time=1:00:00

set -eu
set -x

# How many run you logic needs.
STEP_COUNT=128

# due to the parallel nature of the SLURM steps described below, we need a
# way to properly log each one of them. See the:
# 2>&1 | tee "StepLogs/${SLURM_JOBID}.${I}"
mkdir -p StepLogs

for ((I = 0; I < STEP_COUNT; I++)); do
    srun --exclusive --nodes=2 --ntasks-per-node=3 --cpus-per-task="4" --threads-per-core=1 --label \
        -- ./work.sh 2>&1 | tee "StepLogs/${SLURM_JOBID}.${I}" &
done

# We started STEP_COUNT steps AKA srun processes, wait for them.
wait

In the script above, the steps will all be initiated but will start only when enough resource is available on the set of allocated resources (here we asked for 4 nodes). Here work.sh represent your workload. This workload command would be executed as many times as STEP_COUNT*nodes*ntask-per-node=128*2*3=768 each with 4 cores. SLURM will automatically fill the resource allocated (here 4 nodes), queue and start as needed.

Chained job

SLURM offers a feature allowing the user to chain job. The user can, in fact, define a dependency graph of the jobs.

As an example, we want to start a job represented by my_first_job.sh and start an other job my_second_job.sh which we want to start only when my_first_job.sh finished:

$ sbatch my_first_job.sh
Submitted batch job 189562
$ sbatch --dependency=afterok:189562 my_second_job.sh
Submitted batch job 189563
$ sbatch --dependency=afterok:189563 my_other_job.sh
Submitted batch job 189564

In this example we use the afterok trigger meaning that only if the parent job ends successfully (exit code 0) will it start.

You will then see something like this in squeue:

$ squeue --me
JOBID  PARTITION NAME USER ST TIME NODES NODELIST(REASON)
189562 mi250     test bae  PD 0:00 1     (Dependency)
189563 mi250     test bae  PD 0:00 1     (Dependency)
189564 mi250     test bae  R  0:04 1     g1057

Note the Dependency status.

You can replace afterok by after, afterany, afternotok or singleton. More information here: https://slurm.schedmd.com/sbatch.html#OPT_dependency

Job array

Warning

If you launch job arrays, ensure that they do not contain more that 128 jobs or you will get an error related to AssocMaxSubmitJobLimit.

Other common SLURM commands

The table below summarizes commonly-used SLURM commands:

sinfo

Used to view partition and node information.
i.e., to view user-defined details about the batch queue:
sinfo -p batch -o "%15N %10D %10P %10a %10c %10z"

squeue

Used to view job and job step information for jobs in the scheduling queue.
i.e., to see your own jobs:
squeue -l --me

sacct

Used to view accounting data for jobs and job steps in the job accounting log (currently in the queue or recently completed).
i.e., to see a list of specified information about all jobs submitted/run by a users since 1 PM on January 4, 2023:
sacct -u <login> -S 2023-01-04T13:00:00 -o "jobid%5,jobname%25,user%15,nodelist%20" -X

scancel

Used to signal or cancel jobs or job steps.
i.e., to cancel a job:
scancel <job_id>

We describe some of the usage of these commands below in Monitoring and modifying batch jobs.

Job state

A job will transition through several states during its lifetime. Common ones include:

State
Code
State

Description

CA

Canceled

The job was canceled (could’ve been by the user or an administrator).

CD

Completed

The job completed successfully (exit code 0).

CG

Completing

The job is in the process of completing (some processes may still be running).

PD

Pending

The job is waiting for resources to be allocated.

R

Running

The job is currently running.

Job reason codes

In addition to state codes, jobs that are pending will have a reason code to explain why the job is pending. Completed jobs will have a reason describing how the job ended. Some codes you might see include:

Reason

Meaning

Dependency

Job has dependencies that have not been met.

JobHeldUser

Job is held at user’s request.

JobHeldAdmin

Job is held at system administrator’s request.

Priority

Other jobs with higher priority exist for the partition/reservation.

Reservation

The job is waiting for its reservation to become available.

AssocMaxJobsLimit

The job is being held because the user/project has hit the limit on running jobs.

AssocMaxSubmitJobLimit

The limit on the number of jobs a user is allowed to have running or pending at a given time has been met for the requested association (array).

ReqNodeNotAvail

The user requested a particular node, but it is currently unavailable (it is in use, reserved, down, draining, etc.).

JobLaunchFailure

Job failed to launch (could due to system problems, invalid program name, etc.).

NonZeroExitCode

The job exited with some code other than 0.

Many other states and job reason codes exist. For a more complete description, see the squeue man page (either on the system or online).

More reasons are given in the official SLURM documentation.

Monitoring and modifying batch jobs

scancel: Cancel or signal a job

SLURM allows you to signal a job with the scancel command. Typically, this is used to remove a job from the queue. In this use case, you do not need to specify a signal and can simply provide the jobid. For example, scancel 12345.

In addition to removing a job from the queue, the command gives you the ability to send other signals to the job with the -s option. For example, if you want to send SIGUSR1 to a job, you would use scancel -s USR1 12345.

squeue: View the job queue

The squeue command is used to show the batch queue. You can filter the level of detail through several command-line options. For example:

squeue --long

Show all jobs currently in the queue.

squeue --long --me

Show all of your jobs currently in the queue.

squeue --me --start

Show all of your jobs that have yet to start and show their expected start time.

sacct: Get job accounting information

The sacct command gives detailed information about jobs currently in the queue and recently-completed jobs. You can also use it to see the various steps within a batch jobs.

sacct -a -X

Show all jobs (-a) in the queue, but summarize the whole allocation instead of showing individual steps (-X).

sacct -u ${USER}

Show all of your jobs, and show the individual steps (since there was no -X option).

sacct -j 12345

Show all job steps that are part of job 12345.

sacct -u ${USER} -S 2022-07-01T13:00:00 -o "jobid%5,jobname%25,nodelist%20" -X

Show all of your jobs since 1 PM on July 1, 2022 using a particular output format.

scontrol show job: Get Detailed Job Information

In addition to holding, releasing, and updating the job, the scontrol command can show detailed job information via the show job subcommand. For example, scontrol show job 12345.

Note

scontrol show job can only report information on a job that is in the queue. That is, pending or running (but there are more states). A finished job is not in the queue and not queryable with scontrol show job.

Obtaining the energy consumption of a job

On Adastra, we enable the user to monitor the energy his job consumes.

$ sacct --format=JobID,ElapsedRaw,ConsumedEnergyRaw,NodeList --jobs=<job_id>
JobID          ElapsedRaw ConsumedEnergyRaw        NodeList
-------------- ---------- ----------------- ---------------
<job_id>              104          12934230 c[1000-1003,10+
<job_id>.batch        104             58961           c1000
<job_id>.0             85          12934230 c[1000-1003,10+

The user obtains, for a given <job_id>, the elapsed time in secondes and the energy consumption in joules for the whole job, the execution of the batch script and for each job steps. The job steps are suffixed with \.[0-9]+ (in regex form).

Each time you execute the srun comment in a batch script, it creates a new job step. Here, there is only one srun step which took 85 secondes and 12934230 joules.

Note

The duration of the step as reported by SLURM is not reliable for a short step. There may be an additional ~10 secondes.

Note

You will only get meaningful values regarding a job step once the job step has ended.

Note

The energy returned represents the aggregated node consumption. We do not include the network and storage costs as these ones are trickier to get and consist in a near fixed cost anyway (that is, whether you run are not your code).

Note

Some compute node may not return an energy consumed value. This leads to a value of 0 or empty under ConsumedEnergyRaw. To workaround the issue, one can use the following command: scontrol show node | grep -e "CurrentWatts=n/s" -e "CurrentWatts=0" -B15 | grep "NodeName=" | cut -d '=' -f 2 | awk '{print $1}' | tr '\n' ',' and feed the result to the SLURM commands’ --exclude= option. For instance: sbatch --exclude="$(scontrol show node | grep -e "CurrentWatts=n/s" -e "CurrentWatts=0" -B15 | grep "NodeName=" | cut -d '=' -f 2 | awk '{print $1}' | tr '\n' ',')" job.sh.

Note

The counters SLURM uses to compute the energy consumption are visible in the following files: /sys/cray/pm_counters/*.

Coredump files

If you start a program through our batch scheduler (SLURM), and if your program crashes, you will find your coredump files in the ${SCRATCHDIR}/COREDIR/<job_id>/<hostname> directory. The ${SCRATCHDIR} correspond to the scratch directory associated to your user and project specified in the #SBATCH --account=<account_to_charge> batch script option. The files are stored in different folders depending on the <job_id>. Additionally, if your job ran on multiple nodes, it is useful to have a way to differentiate which coredump file originate from which node, thus, the <hostname> of the node is used to define a path for the coredump files.

The coredump filename has the following semantic: core_<signal>_<timestamp>_<process_name>.dump.<process_identifier> (the equivalent core pattern being core_%s_%t_%e.dump). As an example, you could have such coredump filename:

core_11_1693986879_segfault_testco.dump.2745754

You can then exploit a coredump file by using tools such as GDB like so:

$ gdb ./path/to/program.file ./path/to/coredump.file

You can find more information on GDB and coredump files here.

Warning

Be careful that you do not fill all your scratch space quota with coredump files. Notably, if you run a large job that crashes.

Note

On Adastra, ulimit -c unlimited is the default. The coredump placement to scratch works on the HPDA, MI250 and GENOA partitions. To deactivate the core dumping, run the following command in, say, your batch script: ulimit -c 0.

Note

Use gcore <pid> to explicitly generate a core file of a running program.

Warning

For the placement of the coredump to the scratch to work, one needs to use either a batch script or the salloc + srun commands. Simply allocating (salloc) and ssh ing to the node will not properly configure the coredump placement mechanism. Also, one needs to request nodes in exclusive mode for the placement to work (in shared mode it will not work).