Libraries & toolchains

CINES Spack modules

CINES provides libraries built using Spack. To access these products, check this document and or look at this product catalog.

Example script for CINES Spack modules libraries

Warning

Work in progress.

Apex

Note

You may benefit from using a Python virtual environment, check this document on how to setup one.

Requirements:

You can compile Apex on Adastra using the commands below.

#!/bin/bash

set -eu

# Cloning the code on a login node.
git clone --branch="release/1.2.0" https://github.com/ROCm/apex || true
cd apex

# Build the code on a COMPUTE NODE and installing it in the current Python
# environment.
module purge

module load cpe/23.12
module load craype-accel-amd-gfx90a craype-x86-trento
module load PrgEnv-gnu
module load amd-mixed
module load cray-python

module list

# Your VEnv may hav been deactivated due to the modules loaded above.
# Source a VEnv if needed.
# source ./python_environment/bin/activate
# Ensure you have installed PyTorch for the correct GPUs.

pip3 install ninja
pip3 install --verbose --no-cache-dir --no-build-isolation --config-settings="--build-option=--cpp_ext" --config-settings="--build-option=--cuda_ext" ./

Warning

Apex has strong PyTorch and ROCm version dependencies. Check this document for supported PyTorch versions for a given Apex version.

Deepspeed

When it comes to training large models, it becomes crucial to efficiently distribute the weights, gradient and optimizer state across devices. Deepspeed provides first class AMD support for such functionalities.

Note

You may benefit from using a Python virtual environment, check this document on how to setup one.

Requirements:

JITed kernels

The default and most simple way to install Deepspeed is to pip3 install deepspeed, and let torch’s JIT C++ extension loader do the compiling at runtime.

Prebuilt kernels

You can compile DeepSpeed on Adastra using the commands below.

#!/bin/bash

set -eu

# Cloning the code on a login node.
git clone --branch="v0.14.1" https://github.com/microsoft/DeepSpeed || true
cd DeepSpeed

# Build the code on a COMPUTE NODE and installing it in the current Python
# environment.
module purge

module load cpe/23.12
module load craype-accel-amd-gfx90a craype-x86-trento
module load PrgEnv-gnu
module load amd-mixed
module load cray-python

module list

# Your VEnv may hav been deactivated due to the modules loaded above.
# source ./python_environment/bin/activate
# Ensure you have installed PyTorch for the correct GPUs.

# Ensure we don't have deepspeed already. Indeed it is a common error to first
# install deepspeed and then try to re-install with prebuilt kernels, it ends up
# doing nothing! To ensure we rebuild, the "--no-cache-dir" is also helpful.
pip3 uninstall --yes deepspeed

export DS_ACCELERATOR="cuda" # Compile to HIP even on login node.
export PYTORCH_ROCM_ARCH="gfx90a"

export DS_BUILD_OPS=1
export DS_BUILD_AIO=0                    # 0 KO missing libaio
export DS_BUILD_CCL_COMM=0               # 0 KO missing oneapi
export DS_BUILD_CPU_ADAGRAD=1
export DS_BUILD_CPU_ADAM=1
export DS_BUILD_CPU_LION=1
export DS_BUILD_EVOFORMER_ATTN=0         # 0 KO, idk why
export DS_BUILD_FP_QUANTIZER=0           # 0 KO cuda_bf16. not hip/hip_bf16.
export DS_BUILD_FUSED_ADAM=1
export DS_BUILD_FUSED_LAMB=1
export DS_BUILD_FUSED_LION=1
export DS_BUILD_INFERENCE_CORE_OPS=0     # 0 KO blas_utils_hip.h:60:1: error: ‘rocblas_operation’ does not name a type
export DS_BUILD_CUTLASS_OPS=0            # 0 KO missing deepspeed-kernels
export DS_BUILD_QUANTIZER=1
export DS_BUILD_RAGGED_DEVICE_OPS=0      # 0 KO missing deepspeed-kernels
export DS_BUILD_RAGGED_OPS=1
export DS_BUILD_RANDOM_LTD=1
export DS_BUILD_SPARSE_ATTN=0            # 0 KO
export DS_BUILD_SPATIAL_INFERENCE=1
export DS_BUILD_STOCHASTIC_TRANSFORMER=1
export DS_BUILD_TRANSFORMER_INFERENCE=0  # 0 KO inference_cublas_wrappers_hip.h:24:20: error: ‘rocblas_operation’ was not declared in this scope
export DS_BUILD_TRANSFORMER=1

pip3 install --verbose --no-cache-dir --no-build-isolation ./

Warning

Regardless of the DS_BUILD_* options you choose, do run the ds_report command on a compute node to check if the library is working properly. You should see a bunch of OKAY in the right column (compatible), and if you choose to build the “ops” (DS_BUILD_*), into the left column (installed), a bunch of YES should be present.

Some environment variables

Feature/flag/environnement variable

Explanation

export DS_ENV_FILE=<path_to_spack_configuration>

Can be used to factorize a configuration. See this document on what should/can go into this file.

Fast attention

When it comes to training transformer based LLM, it becomes crucial to efficiently compute the attention mechanism. For Nvidia GPUs, a reference implementation is flash-attn. To ensure a seamless inter operation of a PyTorch code on Nvidia or AMD GPUs, AMD forked the project. That being said, the AMD package is not available in the official PyPi repository, you have to compile it yourself.

Note

You may benefit from using a Python virtual environment, check this document on how to setup one.

Requirements:

You can compile flash-attn on Adastra using the commands below.

#!/bin/bash

set -eu

# Cloning the code on a login node.
git clone --branch="flash_attention_for_rocm" https://github.com/ROCm/flash-attention || true
cd flash-attention
git submodule init
git submodule update

# Build the code on a COMPUTE NODE and installing it in the current Python
# environment.
module purge

module load cpe/23.12
module load craype-accel-amd-gfx90a craype-x86-trento
module load PrgEnv-gnu
module load amd-mixed
module load cray-python

module list

# Your VEnv may hav been deactivated due to the modules loaded above.
# Source a VEnv if needed.
# source ./python_environment/bin/activate
# Ensure you have installed PyTorch for the correct GPUs.

pip3 install ninja einops packaging

export GPU_ARCHS="gfx90a"

pip3 install --verbose --no-cache-dir --no-build-isolation ./

HDF5

If you bypass the Cray compiler wrappers, you need to do more work and risk mixing incompatible libraries or ABI. Use at your own risk the information given below:

Implementation

Module

Compiler or wrapper

Header Files & Linking

HDF5

cray-hdf5

cc, CC, ftn (Cray compiler wrappers)

The wrappers will automatically add the header flags and the linker flags.

Raw compilers.

First, ensure you have the module loaded then,
to compile: -isystem "${CRAY_HDF5_DIR}/<compiler>/<version>/include";
to link: -L"${CRAY_HDF5_DIR}/<compiler>/<version>/lib" -lhdf5_hl -lhdf5 -lhdf5_hl_cpp -lhdf5_cpp -lhdf5hl_fortran -lhdf5_fortran.
Or (more concisely):
to compile: -isystem "${CRAY_HDF5_PREFIX}/include";
to link: -L"${CRAY_HDF5_PREFIX}/lib" -lhdf5_hl -lhdf5 -lhdf5_hl_cpp -lhdf5_cpp -lhdf5hl_fortran -lhdf5_fortran.

Parallel HDF5

cray-hdf5-parallel

cc, CC, ftn (Cray compiler wrappers)

The wrappers will automatically add the header flags and the linker flags.

Raw compilers.

First, ensure you have the module loaded then,
to compile: -isystem "${CRAY_HDF5_PARALLEL_DIR}/<compiler>/<version>/include";
to link: -L"${CRAY_HDF5_PARALLEL_DIR}/<compiler>/<version>/lib" -lhdf5_hl_parallel -lhdf5_parallel.
Or (more concisely):
to compile: -isystem "${CRAY_HDF5_PARALLEL_PREFIX}/include";
to link: -L"${CRAY_HDF5_PARALLEL_PREFIX}/lib" -lhdf5_hl_parallel -lhdf5_parallel.

Note

The cray-hdf5[-parallel] modules define the ${HDF5_ROOT} environment variable. It is equivalent to ${CRAY_HDF5_PARALLEL_PREFIX}.

Where compiler can be [AMD|AOCC|CRAYCLANG|GNU|INTEL|NVIDIA]. variant is left as a choice for the user:

Generally, one relies on the compiler wrappers like so:

$ module load craype-x86-genoa
$ module load PrgEnv-gnu
$ # The cc, CC, ftn wrappers will automatically add the header flags and the
$ # linker flags.

BLAS, LAPACK & ScaLAPACK

Note

LibSci is based on OpenBLAS (GotoBLAS). You probably do NOT need to compile your own BLAS.

Note

We have know that applications using LibSci can observe better performance. We recommend LibSci if your code allows it.

LibSci

LibSci is a collection of numerical routines. It contains, notably:

  • a Basic Linear Algebra Subroutines (BLAS);

  • a C interface to BLAS (CBLAS);

  • a Basic Linear Algebra Communication Subprograms (BLACS);

  • a Linear Algebra PACKage (LAPACK);

  • and a Scalable LAPACK (ScaLAPACK).

Two flavors of this library exist, one for CPUs exposed through the cray-libsci module and a CPU/GPU hybrid one, exposed through the cray-libsci_acc module.

CPU

If you bypass the Cray compiler wrappers, you need to do more work and risk mixing incompatible libraries or ABI. Use at your own risk the information given below:

Implementation

Module

Compiler or wrapper

Header Files & Linking

LibSci

cray-libsci

cc, CC, ftn (Cray compiler wrappers)

The wrappers will automatically add the header flags and the linker flags.

Raw compilers.

First, ensure you have the module loaded then,
to compile: -isystem "${CRAY_PE_LIBSCI_BASE_DIR}/<compiler>/<version>/<cpu>/include";
to link: -L"${CRAY_PE_LIBSCI_BASE_DIR}/<compiler>/<version>/<cpu>/lib" -lsci_<compiler> -lsci_<compiler>_<variant>.
Or (more concisely, use ${CRAY_PE_LIBSCI_PREFIX} for craype >= 2.7.30, else ${CRAY_LIBSCI_PREFIX_DIR}):
to compile: -isystem "${CRAY_PE_LIBSCI_PREFIX}/include";
to link: -L"${CRAY_PE_LIBSCI_PREFIX}/lib" -lsci_<compiler> -lsci_<compiler>_<variant>.

Where compiler can be [AMD|AOCC|CRAYCLANG|GNU|INTEL|NVIDIA]; cpu is x86_64. sci_<compiler>_<variant> is left as a choice for the user:

Library name

Variant

Link type

Threading

MPI

Service provided

libsci_<compiler>.a

-

Static

Serial non-threaded

non-MPI

BLAS/CBLAS/LAPACK

libsci_<compiler>.so

Dynamic shared

Serial non-threaded

non-MPI

libsci_<compiler>_mp.a

mp

Static

Parallel multi-threaded (OpenMP)

non-MPI

libsci_<compiler>_mp.so

Dynamic shared

Parallel multi-threaded (OpenMP)

non-MPI

libsci_<compiler>_mpi.a

mpi

Static

Serial non-threaded

MPI

BLACS/ScaLAPACK

These requires a BLAS and LAPACK. You should link one of the above mentioned libraries providing such facilities.

libsci_<compiler>_mpi.so

Dynamic shared

Serial non-threaded

MPI

libsci_<compiler>_mpi_mp.a

mpi_mp

Static

Parallel multi-threaded (OpenMP)

MPI

libsci_<compiler>_mpi_mp.so

Dynamic shared

Parallel multi-threaded (OpenMP)

MPI

Note

BLAS and LAPACK routines do not require MPI and are contained in the non-MPI version of the libraries. The ScaLAPACK routines are imported by linking with the MPI version of the library. The MPI version of the library implies a dependency on the non-MPI version (i.e., libsci_<compiler>_mpi.so depends on the routines of libsci_<compiler>.so). Obviously, this dependency chain does NOT work for static libraries; then you have to explicitly add the libsci_<compiler>.a.

Warning

The static versions of LibSci for the AMD (ROCm) compiler needs the following additional linker flags: -L${ROCM_PATH}/llvm/lib -lflang -lpgmath -lflangrti -lm -ldl.

Regarding the OpenMP variants; if you call a LibSci routine inside of an OpenMP parallel region, the routine will always be serial. If you call a routine outside of a parallel region and if the LibSci linked is OpenMP enabled, the routine will be threaded, else it will be serial.

Generally, one relies on the compiler wrappers like so:

$ module load craype-x86-genoa
$ module load PrgEnv-gnu
$ # The cc, CC, ftn wrappers will automatically add the header flags and the
$ # linker flags.

By default, the compiler wrappers will link with a single threaded BLAS, LAPACK and ScaLAPACK. If you add the -fopenmp flag, the multi-threaded version we be used instead.

$ # Implies (as needed) lsci_<compiler>_mpi.so    and lsci_<compiler>.so
$ CC          main.cc
$ # Implies (as needed) lsci_<compiler>_mpi_mp.so and lsci_<compiler>_mp.so
$ CC -fopenmp main.cc

More details in:

$ module load cray-libsci
$ man intro_libsci

Note

The serial LibSci should be compatible across toolchain (thanks to proper C interfaces). This is not the case with the multi-threaded versions because the OpenMP backend differs across toolchains.

Warning

If using a mixed PrgEnv (say, PrgEnv-cray-amd), the ${CRAY_PE_LIBSCI_PREFIX} variable points towards a build for the C/C++ compiler. For PrgEnv-cray-amd it would be the CCE. For PrgEnv-gnu-amd it would be GNU.

CPU/GPU hybrid

The accelerated LibSci is augmented with GPU capabilities. It can decide for itself wether to use the GPU or CPU version of a routine.

Before calling such routines, one needs to initialize the library:

Fortran API:
    subroutine libsci_acc_init()
    subroutine libsci_acc_finalize()
C/C++ API:
    void libsci_acc_init(void);
    void libsci_acc_finalize(void);

If you bypass the Cray compiler wrappers, you need to do more work and risk mixing incompatible libraries or ABI. Use at your own risk the information given below:

Implementation

Module

Compiler or wrapper

Header Files & Linking

LibSci_accel

cray-libsci_acc

cc, CC, ftn (Cray compiler wrappers)

The wrappers will automatically add the header flags and the linker flags.

Raw compilers.

First, ensure you have the module loaded then,
to compile: -isystem "${LIBSCI_ACC_BASE_DIR}/<compiler>/<version>/<cpu>/include";
to link: -L"${LIBSCI_ACC_BASE_DIR}/<compiler>/<version>/<cpu>/lib" -lsci_acc_<compiler>_<variant>.
Or (more concisely, use ${CRAY_PE_LIBSCI_PREFIX} for craype >= 2.7.30, else ${CRAY_LIBSCI_PREFIX_DIR}):
to compile: -isystem "${CRAY_LIBSCI_ACC_PREFIX_DIR}/include";
to link: -L"${CRAY_LIBSCI_ACC_PREFIX_DIR}/lib" -lsci_acc_<compiler>_<variant>.

Where compiler can be [AMD|CRAYCLANG|GNU]; cpu is x86_64. sci_acc_<compiler>_<variant>’s <variant> is conditioned by the architecture (gfx90a for MI250X).

Generally, one relies on the compiler wrappers like so:

$ module load craype-x86-trento craype-accel-amd-gfx90a
$ module load PrgEnv-cray
$ module load cray-libsci_acc
$ # The cc, CC, ftn wrappers will automatically add the header flags and the
$ # linker flags.

Warning

To properly link, the compiler wrappers need loaded architecture modules!

More details in:

$ module load cray-libsci_acc
$ man intro_libsci_acc

Some environment variables

Feature/flag/environnement variable

Explanation

export CRAYBLAS_PROFILING_VERBOSITY=2

Log profiling data about the Cray BLAS (libsci).

export CRAYBLAS_LEVEL1_LEGACY=1

As we mentioned above, LibSci is based on OpenBLAS. These environment variables make the routine calls into LibSci be directly forwarded to OpenBLAS. In such case, LibSci should behave as if it was OpenBLAS.

export CRAYBLAS_LEVEL2_LEGACY=1

export CRAYBLAS_LEVEL3_LEGACY=1

OpenBLAS

CINES provides OpenBLAS Spack products, for the codes that demand, require, necessitate, beg, cry for, expect it and prescribe or forbid any other implementation.

Note

You should first try to adapt your code so that it is flexible enough to accept a wrapper provided BLAS, LAPACK. Then use the Cray LibSci.

Check the catalog for the exact module names or use module spider openblas.

Low level primitives

All the libraries show below are not guaranteed to be available on Adastra. If you have a strong need for a particular library to be installed by CINES, please notify svp@cines.fr.

Operation

AMD GPU

Nvidia GPU

x86 CPU

BLAS/CBLAS (level 1, 2, 3)

rocBLAS (hipBLAS)/cray-libsci_acc

cuBLAS (hipBLAS)/cray-libsci_acc

cray-libsci/Intel MKL/OpenBLAS/AMD BLIS

Dense system solvers

rocSOLVER (hipSOLVER)/cray-libsci_acc/Magma/SLATE

cuSOLVER (hipSOLVER)/cray-libsci_acc/Magma/SLATE

cray-libsci/Magma/Intel MKL/OpenBLAS/AMD libFLAME/LAPACK/ScaLAPACK/SLATE

Sparse BLAS (level 1, 2, 3)

rocSPARSE (hipSPARSE)

cuSPARSE (hipSPARSE)

Intel MKL

Sparse system solvers (preconditioners and geometric or algebraic multigrid solver)

rocALUTION/PETSc/Ginkgo

Nvidia AMGx/PETSc/Ginkgo/HYPRE

Intel MKL/PETSc/HYPRE

Discrete Fourier transform

rocFFT (hipFFT)/vkFFT

cuFFT (hipFFT)/vkFFT

cray-fftw/FFTW/Intel MKL

Deep learning primitives

MIOpen

cuDNN

Caffe/Uncountable

Algorithmic building blocks

rocThrust

Thrust

Thrust/C++ standard library

Communication primitives

MPI/RCCL

MPI/NCCL

MPI/PVM

Pseudo number generation

rocRAND (hipRAND)

cuRAND (hipRAND)

Uncountable

Low level GPU primitives

rocPRIM (hipCUB)

CUB (hipCUB)

-

Note

AMD provides compatibility layers prefixed with hip like hipBLAS. These libraries provide a stable interface allowing the user to change the backend from rocBLAS to cuBLAS easily. Theses compatibility layers should be used instead of the platform specific roc/cuBLAS.

Note

Not all the technologies on a row are strictly equivalent.

MIOpen

If using MIOpen we strongly recommend that you use the environment variables show below. These serves to mitigate the contention that can occur due to the usage of an auto-tunning caching database MIOpen writes in the home directory.

$ export MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}"
$ export MIOPEN_CUSTOM_CACHE_DIR="${MIOPEN_USER_DB_PATH}"

Or (disable auto-tunning caching entirely, less recommended):

$ export MIOPEN_DEBUG_DISABLE_FIND_DB=1

In addition to making large distributed training more scalable, it can prevent issues of the following form:

Failed to store record to find-db at </lus/home/<user_path>/.config/miopen//....ufdb.txt>
Ill-formed record: key not found: /lus/home/<user_path>/.config/miopen//....ufdb.txt

MPI

The provided and thus, default and preferred MPI implementation on Cray systems such as Adastra is Cray’s MPICH.

Cray MPICH

You will find information regarding Cray MPICH in the INTRO_MPI(3) man page or, as Cray tries to mirror some of it’s documentation online, on this URL: https://cpe.ext.hpe.com/docs/mpt/mpich/index.html#mpich.

Some features of the Cray MPICH implementation are that:

  • it supports the MPI 3.2 standard;

  • its GPU aware, meaning it gives the ability to pass a GPU buffer to the MPI API;

  • its GPU direct, meaning it has the transparent (few visible artifacts) ability to use RDMA technics between GPUs. This can elide copies.

The Cray compiler wrappers justify the lack of the usual MPI compiler wrappers like mpicc or similar (which often only add link and header flags). You implicitly link with Cray’s MPICH when you use the Cray compiler wrappers (cc, CC and ftn) instead of the common mpicc, mpicxx and mpif90 compiler wrappers. The cray-mpich module is automatically loaded when you load a PrgEnv-*.

If you bypass the Cray compiler wrappers, you need to do more work and risk mixing incompatible libraries or ABI. Use at your own risk the information given below:

Implementation

Module

Compiler or wrapper

Header Files & Linking

Cray MPICH

cray-mpich

cc, CC, ftn (Cray compiler wrappers)

The wrappers will automatically add the header flags and the linker flags.

Raw compilers and GPU-direct MPI.

First, ensure you have the module loaded. This is implicit to the usage of a PrgEnv-* then,
to compile: -isystem ${CRAY_MPICH_PREFIX}/include;
to link: -L${CRAY_MPICH_PREFIX}/lib -lmpi
${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}.

Raw compilers and CPU only MPI.

First, ensure you have the module loaded. This is implicit to the usage of a PrgEnv-* then,
to compile: -isystem ${CRAY_MPICH_PREFIX}/include;
to link: -L${CRAY_MPICH_PREFIX}/lib -lmpi.

GPU aware & GPU direct

To use the Cray MPICH’s GPU-aware functionality, the user must load some modules and set an environment variable. You need to load the cray-mpich but know that any PrgEnv loads it for you. As an example, with PrgEnv-cray:

$ module purge
$ module load cpe/23.12
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ module load PrgEnv-cray
$ module load amd-mixed
$ # This is the environment variable that tells Cray MPICH to enable GPU aware features.
$ export MPICH_GPU_SUPPORT_ENABLED=1

If you chose to use the Cray compiler wrappers, make sure you correctly load the modules used to target an architecture. If the craype-accel-amd-gfx90a module is not loaded, the Cray compiler wrappers wont link with the mpi_gtl_hsa library and at runtime, if you set export MPICH_GPU_SUPPORT_ENABLED=1 you will get an error like that:

MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked

Warning

The export MPICH_GPU_SUPPORT_ENABLED=1 environment variable is required for Cray MPICH to provide GPU support through the GPU Transport Layer (GTL) plugin. We will not always add it in the scripts below but do not forget it if needed!

Some environment variables

Feature/flag/environnement variable

Explanation

export MPICH_VERSION_DISPLAY=1

If enabled, causes MPICH to display the Cray MPI version number as well as build date information.

export MPICH_ENV_DISPLAY=1

If set, causes rank 0 to display all MPICH environment variables and their current settings at MPI initialization time.

export MPICH_MEMORY_REPORT=3

If set to 1, print a summary of the min/max high water mark and associated rank to stderr. If set to 2, output each rank’s high water mark to a file as specified using ${MPICH_MEMORY_REPORT_FILE}. If set to 3, do both 1 and 2.

export MPICH_ABORT_ON_ERROR=1

If enabled, causes MPICH to abort and produce a core dump when MPICH detects an internal error. Note that the core dump size limit (usually 0 bytes by default) must be reset to an appropriate value in order to enable coredumps.

export MPICH_CPUMASK_DISPLAY=1

Dump the cpuset associated to each rank (verbose).

export MPICH_MPIIO_HINTS_DISPLAY=1

If enabled, causes rank 0 in the participating communicator to display the names and values of all MPI-IO hints that are set for the file being opened with the MPI_File_open call.

export MPICH_OFI_VERBOSE=1

If set, more verbose output will be displayed during MPI_Init to verify which libfabric provider has been selected, along with the name and address of the NIC being used. This may be helpful for debugging errors encountered during MPI_Init.

export MPICH_OFI_NIC_VERBOSE=[0-2]

If set to 1, verbose information pertaining to NIC selection is printed at the start of the job. All available NIC domain names, addresses and index values are displayed. Setting this variable to 2 displays additional details, including the specific NIC each rank has been assigned, which is based on ${MPICH_OFI_NIC_POLICY}.

export MPICH_OFI_CXI_COUNTER_REPORT=[0-5]

Determines if Cassini (CXI) counters are collected during the application and the verbosity of the counter data report displayed during MPI_Finalize.

export MPICH_OFI_CXI_COUNTER_VERBOSE=1

If set to a non-zero value, this enables more verbose output about the Cassini counters being collected. Can be helpful for debugging and/or identifying which counters are being collected. Only applicable to Slingshot 11.

export FI_LOG_LEVEL=[debug|info|race|warn]

Controls the amount of logging data that is output.

export FI_CXI_RX_MATCH_MODE=hybrid

Use this for flow control or queue exhaustion problems. We are seeing a lot of these but a burst of communication occurs.

export MPICH_SINGLE_HOST_ENABLED=0

When all the ranks are on the same node, forces the traffic through the NIC instead of using node local optimization (shared memory).

export MPICH_SMP_SINGLE_COPY_MODE=NONE

Disable P2P IPC for CPU to CPU, CPU to GPU and GPU to GPU.

Further reading: man intro_mpi.

Detecting if MPI is GPU aware

As far as we know, there is no standard way to detect if a MPI implementation is GPU aware, even across MPI extension (MPIx). We propose the following GPU aware MPI check implementation that should work for recent OpenMPI and Cray MPICH.

Note

There is an issue on the official MPI standardization committee forum: https://github.com/mpi-forum/mpi-issues/issues/580; but note that it will take quite come time before it is implemented universally.

The client code would look like that:

#include <mpi.h>

#include <iostream>

#include "mpi_gpu_aware.h"

int main(int argc, char** argv) {
    ::MPI_Init(&argc, &argv);

#if defined(MPI_GPU_AWARE_API_SUPPORT)
    std::cout << "[compiletime] Is MPI GPU aware API supported: true\n";
#else
    std::cout << "[compiletime] Is MPI GPU aware API supported: false\n";
#endif

    std::cout << "[runtime]     Is MPI GPU aware:               " << ((::mpi_gpu_aware() < 0) ? "false" : "true") << "\n";

    ::MPI_Finalize();
    return 0;
}

The mpi_gpu_aware.h code is:

#ifndef MPI_GPU_AWARE_H
#define MPI_GPU_AWARE_H

#include <mpi.h>

#if defined(CRAY_MPICH_VERSION) && defined(MPIX_GPU_SUPPORT_CUDA)
    #define MPI_GPU_AWARE_CRAYMPICH_API_SUPPORT       1
    #define MPI_GPU_AWARE_CRAYMPICH_POTENTIAL_SUPPORT 1

#elif defined(OPEN_MPI)

    #include <mpi-ext.h>

    #if(defined(OMPI_HAVE_MPI_EXT_ROCM) && OMPI_HAVE_MPI_EXT_ROCM) || \
        (defined(OMPI_HAVE_MPI_EXT_CUDA) && OMPI_HAVE_MPI_EXT_CUDA)

        #define MPI_GPU_AWARE_OPENMPI_API_SUPPORT 1

        #if defined(MPIX_CUDA_AWARE_SUPPORT) && MPIX_CUDA_AWARE_SUPPORT
            #define MPI_GPU_AWARE_OPENMPI_POTENTIAL_SUPPORT 1
        #endif
    #endif
#endif

#if defined(MPI_GPU_AWARE_CRAYMPICH_API_SUPPORT) || \
    defined(MPI_GPU_AWARE_OPENMPI_API_SUPPORT)

    /// The MPI implementation supports a runtime API we can use to ensure GPU
    /// awareness.
    ///
    #define MPI_GPU_AWARE_API_SUPPORT 1
#endif

#if defined(MPI_GPU_AWARE_CRAYMPICH_POTENTIAL_SUPPORT) || \
    defined(MPI_GPU_AWARE_OPENMPI_POTENTIAL_SUPPORT)

    /// We assume it is likely that the MPI implementation is GPU aware ?
    ///
    #define MPI_GPU_AWARE_POTENTIAL_SUPPORT 1
#endif

/// GPU aware MPI is a runtime decision on Cray MPICH. It relies on a library
/// called GPU Transport Layer (GTL). This library is linked by the wrapper when
/// a GPU architecture module is loaded (i.e.: craype-x86-trento).
/// It is not possible to determine at compile time if the MPI implementation
/// will be GPU aware. We can know if the MPI could have that feature enabled,
/// but not if it will be enabled.
/// Returns: < 0 if we could not determine if the MPI is GPU aware.
///
extern "C" int mpi_gpu_aware(void);

#endif

The mpi_gpu_aware.cc code is:

#include <cstdlib>

#include "mpi_gpu_aware.h"

namespace detail {

    static inline int
    EnsureMPIIsInitialized() {
        int is_mpi_initialized = 0;

        if((::MPI_Initialized(&is_mpi_initialized) != MPI_SUCCESS) ||
        (is_mpi_initialized == 0)) {
            return -1;
        }

        // At this point MPI should be initialized. Except if an other thread
        // called MPI_Finalize() in between.

        return 0;
    }

#if defined(MPI_GPU_AWARE_CRAYMPICH_API_SUPPORT)
    static inline int
    getenv() {
        const char* environment_string = std::getenv("MPICH_GPU_SUPPORT_ENABLED");
        if(environment_string == NULL) {
            return -1;
        }

        if(environment_string[0] != '1') {
            return -1;
        }

        return 0;
    }

    /// Rely on MPIX_GPU_query_support. It requires that we have
    /// MPIX_GPU_SUPPORT_CUDA defined to plug into the MPIX_GPU_query_support.
    /// https://www.mpich.org/static/docs/v4.0.x/www3/MPIX_GPU_query_support.html
    ///
    static inline int
    MPIx() {
        int result = 0;

        const auto CheckGPUKindSupport = [](int offset, int gpu_kind) -> int {
            int is_gpu_kind_supported = 0;

            if(::MPIX_GPU_query_support(gpu_kind, &is_gpu_kind_supported) != MPI_SUCCESS) {
                return -1;
            }

            return is_gpu_kind_supported << offset;
        };

    #if defined(MPIX_GPU_SUPPORT_CUDA)
        result |= CheckGPUKindSupport(0, MPIX_GPU_SUPPORT_CUDA);
    #endif

    #if defined(MPIX_GPU_SUPPORT_ZE)
        result |= CheckGPUKindSupport(1, MPIX_GPU_SUPPORT_ZE);
    #endif

    #if defined(MPIX_GPU_SUPPORT_HIP)
        result |= CheckGPUKindSupport(2, MPIX_GPU_SUPPORT_HIP);
    #endif
        return result == 0 ? -1 : result;
    }

    static inline int
    DoCheck() {
        if(getenv() < 0) {
            return -1;
        }

        if(EnsureMPIIsInitialized() < 0) {
            return -1;
        }

        if(MPIx() < 0) {
            return -1;
        }

        return 0;
    }

#elif defined(MPI_GPU_AWARE_OPENMPI_API_SUPPORT)
    static inline int
    MPIx() {
        int result = 0;

    #if defined(MPI_GPU_AWARE_OPENMPI_API_SUPPORT) && defined(OMPI_HAVE_MPI_EXT_CUDA)
        if(::MPIX_Query_cuda_support() != 1) {
            return -1;
        }
        result |= 1 << 0;
    #endif

        // #if defined(MPI_GPU_AWARE_OPENMPI_API_SUPPORT) && defined(OMPI_HAVE_MPI_EXT_ZERO)
        //         if(::MPIX_Query_zero_support() != 1) {
        //             return -1;
        //         }
        //         result |= 1 << 2;
        // #endif

    #if defined(MPI_GPU_AWARE_OPENMPI_API_SUPPORT) && defined(OMPI_HAVE_MPI_EXT_ROCM)
        if(::MPIX_Query_rocm_support() != 1) {
            return -1;
        }
        result |= 1 << 2;
    #endif

        return result == 0 ? -1 : result;
    }

    static inline int
    DoCheck() {
        if(EnsureMPIIsInitialized() < 0) {
            return -1;
        }

        if(MPIx() < 0) {
            return -1;
        }

        return 0;
    }
#elif !defined(MPI_GPU_AWARE_API_SUPPORT)
    static inline int
    DoCheck() {
        return -1;
    }
#endif
} // namespace detail

extern "C" int mpi_gpu_aware(void) {
    static const int result = detail::DoCheck();
    return result;
}

OpenMP

CPU

Implementation

Compiler or wrapper

Header Files & Linking

Compiler specific OpenMP

cc, CC, ftn (Cray compiler wrappers)

To compile: -fopenmp;
to link: -fopenmp.

Raw compilers. For CCE, AMD, GCC.

Note

You may observe the -openmp flag in CMake logs or other documentation. In the CPE context, it is an alias to -fopenmp.

Note

As of 2023/06/01, the Cray compiler provides full OpenMP 4.5 support and and near complete OpenMP 5.0 and 5.1 support (as it depends on Clang’s OpenMP support). The OpenMP version reported to C and C++ codes thanks to the _OPENMP preprocessor definition is 201811 which means version 5.0.

GPU offload

This section shows how to compile with OpenMP Offload using the different compilers covered above.

Note

Make sure the craype-accel-amd-gfx90a module is loaded when using OpenMP offload.

Implementation

Compiler or wrapper

Header Files & Linking

Compiler specific OpenMP

cc, CC, ftn (Cray compiler wrappers)

To compile: -fopenmp;
to link: -fopenmp.

Raw compilers. For CCE, AMD.

To compile: -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<hardware>;
to link: -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<hardware>.

With <hardware> set to gfx90a (MI200).

Warning

(When using CCE + craype-accel-amd-gfx90a) CMake is known to have some issues with Cray’s GPU offload implementation’s special linker (cce_omp_offload_linker). This causes link time issues or false positive (running on GPU). Check this document work around this problem.

Warning

(When using CrayPE + AMD + craype-accel-amd-gfx90a) The Cray wrapper is know to produce bugs like: clang++: fatal error: The option -fopenmp-targets= requires additional options -Xopenmp-target= and -march= .. A simple fix is to always specify the whole, explicit command -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<hardware>.

Some environment variables

Feature/flag/environnement variable

Explanation

export OMP_DISPLAY_ENV=VERBOSE

Instructs the runtime to display the OpenMP version number and the value of the ICVs.

export OMP_DISPLAY_AFFINITY=TRUE

Instructs the runtime to display formatted affinity information for all OpenMP threads in the parallel region upon entering the first parallel region.

Further reading: man intro_openmp.

For the Cray OpenMP and OpenACC backend for GPU offloading

Feature/flag/environnement variable

Explanation

export CRAY_ACC_DEBUG=[0-3]

(Only for OpenACC and OpenMP offload) When the runtime environment variable ${CRAY_ACC_DEBUG} is set to 1, 2, or 3, CCE writes runtime commentary of accelerator activity to STDERR for debugging purposes; every accelerator action on every PE generates output prefixed with “ACC:”. This may produce a large volume of output and it may be difficult to associate messages with certain routines and/or certain PEs.

Further reading: man intro_openmp, man intro_openacc.

PyTorch

(Py)Torch is the defacto library for neural network based artificial intelligence.

To ensure competitive performance, take a lock at theses documents: Installing PyTorch for AMD GPUs, RCCL and IDRIS’ accelerate wrapper.

Installing PyTorch for AMD GPUs

By default pip will install PyTorch for Nvidia GPU. To use PyTorch for AMD GPUs (ROCm stack), you will have to specify a repository like so:

$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7

Warning

You may have issue due to Adastra’s firewall policies preventing access to download.pytorch.org. To workaround the issue, you can download the packages on your local machine and scp it to Adastra.

Warning

Be careful with the torch package. Ensure, you still have the ROCm version at the end of every install.

Note

You may benefit from using a Python virtual environment, check this document on how to setup one.

Torch elastic

Note

If you only use a tiny amount of resource, this document will give you recommendations on how to use the compute resources.

To make use of PyTorch’s distributed features, you can take example from the script below.

#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="<job_name>"
#SBATCH --constraint=MI250
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --time=1:00:00

module purge

module load cray-python
module load aws-ofi-rccl

module list

# Use a Python VEnv unless you install everything in your home.
# source ./python_environment/bin/activate

# NOTE: We recommend that you prefetch your datasets and models on login
# nodes and use the HF_HOME or the DS_DIR environment variable.
# export HF_HOME="${SCRATCHDIR}/USER_DSDIR"
export HF_HUB_OFFLINE=1

export MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}"
export MIOPEN_CUSTOM_CACHE_DIR="${MIOPEN_USER_DB_PATH}"

################################################################################
# Run
################################################################################

srun --ntasks-per-node=1 --gpus-per-task=8 --cpu-bind=none --mem-bind=none --label \
    -- torchrun --nnodes="${SLURM_NNODES}" --nproc_per_node="8" \
    --rdzv-id="${SLURM_JOBID}" \
    --rdzv-backend=c10d \
    --rdzv-endpoint="$(scontrol show hostname ${SLURM_NODELIST} | head -n 1):29400" \
    --max-restarts="0" \
    -- ./script.py

Note

nproc_per_node should be equal to gpus-per-task. This is because torchrun will spawn nproc_per_node child processes, each with its own GPU. This explains why we have ntasks-per-node=1. We only run one torchrun task per node.

Note

The LOGLEVEL="INFO" environment variable may clutter the log. You can comment them if wanted.

IDRIS’ accelerate wrapper

IDRIS provides a wrapper abstracting the task scheduler (say, SLURM) and the parallelization framework (PyTorch’s DDP or FSDP, Microsoft’s DeepSpeed) through HuggingFace’s Accelerate. A typical installation can be done like so:

$ pip3 install transformers deepspeed accelerate datasets torchmetrics idr-torch idr_accelerate

Warning

Be careful with the torch package. Ensure, you still have the ROCm version at the end of every install.

Note

You may benefit from using a Python virtual environment, check this document on how to setup one.

To use idr_accelerate, we provide a example script below. It will run an example script given in: https://github.com/idriscnrs/idr_accelerate/blob/main/examples/multi-node/train_accelerate.py, provided you have downloaded the dataset and model configuration.

#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="<job_name>"
#SBATCH --constraint=MI250
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --time=1:00:00

module purge

module load cray-python
module load aws-ofi-rccl

module list

# Use a Python VEnv unless you install everything in your home.
# source ./python_environment/bin/activate

# NOTE: We recommend that you prefetch your datasets and models on login
# nodes and use the HF_HOME or the DS_DIR environment variable.
# export HF_HOME="${SCRATCHDIR}/USER_DSDIR"
export HF_HUB_OFFLINE=1

export MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}"
export MIOPEN_CUSTOM_CACHE_DIR="${MIOPEN_USER_DB_PATH}"

################################################################################
# Run
################################################################################

srun --ntasks-per-node=1 --gpus-per-task=8 --cpu-bind=none --mem-bind=none --label \
    -- idr_accelerate train_accelerate.py \
    --lr 0.5 \
    --model_dir="..." \
    --data_path="..."

Note

The LOGLEVEL="INFO" environment variable may clutter the log. You can comment them if wanted.

Some environment variables

Feature/flag/environnement variable

Explanation

export LOGLEVEL="INFO"

PyTorch logging level.

export MIOPEN_USER_DB_PATH=/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}
export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH}
Prevent the ranks from starving the filesystem, trying to atomically update a SQLite database.

RCCL

RCCL is AMD’s ROCm stack’s fork of Nvidia’s NCCL. This stands for ROCm Communication Collectives Library (RCCL).

Adastra is built on top of a fast interconnect dubbed Slingshot by Cray. To use this interconnect when doing distributed training, you should load the aws-ofi-rcc module which introduces a LibFabric Slingshot shim. Also, you should remove all NCCL or RCCL environment variables that could interfere with the libraries’ inner working (or use them at your own risk). As an example:

#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="<job_name>"
#SBATCH --constraint=MI250
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --time=1:00:00

module purge

module load cray-python
module load aws-ofi-rccl # <- This matters !

module list

srun ..

Warning

If you are using RCCL as part of PyTorch or any other product that comes with its own ROCm libraries (libamdhip64), you do not need to load a rocm or amd-mixed module. Else, you should load either amd-mixed or rocm on which, more detail is given in this document.

Debugging networking issues

In the scripts you may use the following logging environment variables. The logs get be very verbose:

export NCCL_DEBUG="TRACE"
export NCCL_DEBUG_SUBSYS="ALL"

Note

To ensure you are using the Slingshot interconnect, assuming you have defined the verbose logging environment variable above, you should find the following lines at the start of the logs: NCCL INFO NET/OFI Using aws-ofi-rccl 1.4.0 and NCCL INFO NET/OFI Selected Provider is cxi

ROCm

Warning

ROCm is generally backward compatible which means you can use a recent ROCm with a binary built using an older ROCm. It is not forward compatible, meaning using a recent binary using an older ROCm may not work.

Modules and paths

The ROCm toolchain can be used by loading a module providing ROCm.

$ # A standalone ROCm.
$ module load rocm
$ # A standalone ROCm with a Cray compiler in the mix.
$ module load amd-mixed
$ # Or a programming environment.
$ module load PrgEnv-amd

The libraries can be found in the ${ROCM_PATH}/lib directory. Note that the ${ROCM_PATH} environment variable is defined by the modules given above.

The ROCm libraries provided are:

amd_comgr
amd-dbgapi
AMDDeviceLibs
hip
hip-lang
hsa-runtime64

rocm_smi

miopen
rccl

rocalution
rocblas & hipblas
rocfft & hipfft
rocrand & hiprand
rocsolver & hipsolver
rocsparse & hipsparse

rocprim & hipcub
rocthrust

AMD is extensively improving its documentation. You can find a trove of example on mixing CMake with ROCm here.

Starting with ROCm 5.1.0, here are the CMake prefixes for the provided libraries:

set(amd_comgr_DIR     "$ENV{ROCM_PATH}/lib/cmake/amd_comgr")
set(AMDDeviceLibs_DIR "$ENV{ROCM_PATH}/lib/cmake/AMDDeviceLibs")
set(hip_DIR           "$ENV{ROCM_PATH}/lib/cmake/hip")
set(hsa-runtime64_DIR "$ENV{ROCM_PATH}/lib/cmake/hsa-runtime64")

set(rocm_smi_DIR      "$ENV{ROCM_PATH}/lib/cmake/rocm_smi")

set(miopen_DIR        "$ENV{ROCM_PATH}/lib/cmake/miopen")
set(rccl_DIR          "$ENV{ROCM_PATH}/lib/cmake/rccl")

set(rocalution_DIR    "$ENV{ROCM_PATH}/lib/cmake/rocalution")
set(rocblas_DIR       "$ENV{ROCM_PATH}/lib/cmake/rocblas")
set(hipblas_DIR       "$ENV{ROCM_PATH}/lib/cmake/hipblas")
set(rocfft_DIR        "$ENV{ROCM_PATH}/lib/cmake/rocfft")
set(hipfft_DIR        "$ENV{ROCM_PATH}/lib/cmake/hipfft")
set(rocrand_DIR       "$ENV{ROCM_PATH}/lib/cmake/rocrand")
set(hiprand_DIR       "$ENV{ROCM_PATH}/lib/cmake/hiprand")
set(rocsolver_DIR     "$ENV{ROCM_PATH}/lib/cmake/rocsolver")
set(hipsolver_DIR     "$ENV{ROCM_PATH}/lib/cmake/hipsolver")
set(rocsparse_DIR     "$ENV{ROCM_PATH}/lib/cmake/rocsparse")
set(hipsparse_DIR     "$ENV{ROCM_PATH}/lib/cmake/hipsparse")

set(rocprim_DIR       "$ENV{ROCM_PATH}/lib/cmake/rocprim")
set(hipcub_DIR        "$ENV{ROCM_PATH}/lib/cmake/hipcub")
set(rocthrust_DIR     "$ENV{ROCM_PATH}/lib/cmake/rocthrust")

The CMAke package and component one should link against are:

Library

CMake package name

CMake component

HIP

hip

hip::host

rocPRIM

rocprim

roc::rocprim

rocThrust

rocthrust

roc::rocthrust

hipcub

hipcub

hip::hipcub

rocRAND

rocrand

roc::rocrand

rocSOLVER

rocsolver

roc::rocsolver

hipBLAS

hipblas

roc::hipblas

rocFFT

rocfft

roc::rocfft

hipFFT

hipfft

hip::hipfft

rocSPARSE

rocsparse

roc::rocsparse

hipSPARSE

hipsparse

roc::hipsparse

rocALUTION

rocalution

roc::rocalution

RCCL

rccl

rccl

MIOpen

miopen

MIOpen

MIGraphX

migraphx

migraphx::migraphx migraphx::migraphx_c migraphx::migraphx_cpu, migraphx::migraphx_gpu, migraphx::migraphx_onnx, migraphx::migraphx_tf

Build integration

Using CMake to properly consuming a ROCm library

cmake_minimum_required(VERSION 3.5)
cmake_policy(VERSION 3.5...3.27)
project(example LANGUAGES CXX)

add_library(example_library ...)

find_package(miopen)
target_link_libraries(example_library PUBLIC MIOpen)

HIP as a CMake language

cmake_minimum_required(VERSION 3.21) # HIP language support requires 3.21
cmake_policy(VERSION 3.21.3...3.27)
project(example LANGUAGES HIP)

add_executable(example_binary main.hip)

# If you have .cu files, you can tell CMake to treat them as HIP like so:
add_library(example_library library.cu)
set_source_files_properties(library.cu PROPERTIES LANGUAGE HIP)

Note

CMAKE_HIP_ARCHITECTURES is a semicolon variable that defines the targeted devices. Example: gfx801;gfx900;gfx90a.

HIP using raw compilers or wrappers

Warning

Compiling HIP code is different than just using and linking against the HIP runtime. See Using CMake to properly consume the HIP runtime.

Warning

If you use hipcc, you may want to use the ${HIPCC_COMPILE_FLAGS_APPEND} and ${HIPCC_LINK_FLAGS_APPEND} environment variables to specify your flags. Else, the hipcc wrapper may ignore them.

This section shows how to compile HIP codes using the Cray compiler wrappers and hipcc (itself an amdclang wrapper).

Note

Make sure the craype-accel-amd-gfx90a module is loaded when compiling HIP with the Cray compiler wrappers.

Compiler

Compile/Link Flags, Header Files, and Libraries

CC
Only with:
PrgEnv-cray
PrgEnv-amd
First load the rocm or amd-mixed module then,
to compile: --rocm-path="${ROCM_PATH}" -xhip;
to link: --rocm-path="${ROCM_PATH}" -L"${ROCM_PATH}/lib" -lamdhip64.

hipcc


Can be used directly to compile and link with HIP source/object files.
To see what is being invoked within this compiler wrapper, issue the command, hipcc --verbose.
To explicitly target AMD MI250X, use --offload-arch=gfx90a.
CC and hipcc
With:
CC as a non HIP
compiler like
PrgEnv-gnu.


The GNU compilers cannot be used to compile HIP code, so all HIP kernels must be separated from CPU code.
During compilation, all non-HIP files must be compiled with CC while HIP kernels must be compiled with hipcc.
Then linking can be performed as done above for the AMD/Cray vendors.
First load the rocm or amd-mixed module then,
to compile non HIP code: CC <file.cpp>;
to compile HIP code: hipcc --offload-arch=gfx90a;
to link: CC --rocm-path="${ROCM_PATH}" -L"${ROCM_PATH}/lib" -lamdhip64.

The ROCm modules (rocm or amd-mixed) introduce several environment variables for HIP programming. The command hipconfig --full can be used to check some environment related information.

Warning

By default, if you do not specify the --rocm-path="${ROCM_PATH}", Clang will take the ROCm install present in /opt/rocm or pointed to by the ${ROCM_PATH} environment variable.

Warning

By default amdclang (HIP-Clang) and CrayClang (actually all Clang based GPU code compilers) assumes -ffp-contract=fast-honor-pragmas (less volatile than -Ofast). Users can use #pragma clang fp contract(on|off|fast) to control fp contraction of a block of code. For x86_64, FMA is off by default since the generic x86_64 target does not support FMA by default. To turn on FMA on x86_64, either use -mfma or -march=native on CPUs supporting FMA. When contractions are enabled and the CPU has not enabled FMA instructions, the GPU can produce different numerical results than the CPU for expressions that can be contracted. Tolerance should be used for floating point comparisons. We recommend that you read the HIP Programming Manual.

Warning

We do NOT recommend that you use amdclang directly, always try to use the hipcc wrapper instead. that said, if you must use amdclang, know that you have to pass the -isystem "${ROCM_PATH}/include" --hip-device-lib-path="${ROCM_PATH}/amdgcn/bitcode" --rocm-path=${ROCM_PATH} flags.

An other way to check which flags to use for compilation and linking is to call hipconfig. On a ROCm 5.7.1 it could return:

$ hipconfig
HIP version  : 5.7.31921-1949b1621

== hipconfig
HIP_PATH     : /lus/home/softs/rocm/5.7.1
ROCM_PATH    : /opt/rocm-5.7.1
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME  : rocclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/lus/home/softs/rocm/5.7.1/include -I/lus/home/softs/rocm/5.7.1/llvm/lib/clang/17.0.0


== hip-clang
HIP_CLANG_PATH   : /opt/rocm-5.7.1/llvm/bin
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.7.1 23382 f3e174a1d286158c06e4cc8276366b1d4bc0c914)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.7.1/llvm/bin
AMD LLVM version 17.0.0git
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: znver4

Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags :  -isystem "/lus/home/softs/rocm/5.7.1/include" -O3
hip-clang-ldflags  :  -O3 --hip-link --rtlib=compiler-rt -unwindlib=libgcc

=== Environment Variables
PATH=/opt/rocm-5.7.1/bin:/lus/home/BCINES/dci/user/.local/bin:/lus/home/BCINES/dci/user/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/c3/bin:/sbin:/bin
LD_LIBRARY_PATH=/opt/rocm-5.7.1/lib/roctracer:/opt/rocm-5.7.1/lib/rocprofiler:/opt/rocm-5.7.1/lib
HIP_LIB_PATH=/opt/rocm-5.7.1/lib

== Linux Kernel
Hostname     : login5
Linux login5 4.18.0-477.10.1.el8_8.x86_64 #1 SMP Wed Apr 5 13:35:01 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: RedHatEnterprise
Description:    Red Hat Enterprise Linux release 8.8 (Ootpa)
Release:        8.8
Codename:       Ootpa

HIP + OpenMP CPU threading using raw compilers or wrappers

This section shows how to compile HIP + OpenMP CPU threading hybrid codes. Note that HIP kernels may not appear in translation units with OpenMP support enabled. However, HIP runtime API calls may be used in such previously described translation units.

Note

Make sure the craype-accel-amd-gfx90a module is loaded when compiling HIP with the Cray compiler wrappers.

Compiler/wrapper

Compile/Link Flags, Header Files, and Libraries

CC
Only with:
PrgEnv-cray
PrgEnv-amd
First load the rocm or amd-mixed module then,
to compile: -fopenmp --rocm-path="${ROCM_PATH}" -xhip;
to link: -fopenmp --rocm-path="${ROCM_PATH}" -L"${ROCM_PATH}/lib" -lamdhip64.

hipcc



First load the rocm or amd-mixed module then,
to compile: -fopenmp;
to link: -fopenmp.
To explicitly target AMD MI250X, use --offload-arch=gfx90a at compile time.
CC and hipcc
With:
CC as a non HIP
compiler like
PrgEnv-gnu.


The GNU compilers cannot be used to compile HIP code, so all HIP kernels must be separated from OpenMP CPU code.
During compilation, all non-HIP files must be compiled with CC while HIP kernels must be compiled with hipcc.
Then linking can be performed as done above for the AMD/Cray vendors.
First load the rocm or amd-mixed module then,
to compile non HIP code: CC -fopenmp;
to compile HIP code: hipcc --offload-arch=gfx90a;
to link: CC -fopenmp --rocm-path="${ROCM_PATH}" -L"${ROCM_PATH}/lib" -lamdhip64.

Using CMake to properly consume the HIP runtime

Above, we introduced a way to compile .hip files. Sometimes you only want to have access to the HIP runtime (say, hipMalloc) through the C API. This does not require a special device compiler, we only need the header and libraries. Under CMake it is done like so:

cmake_minimum_required(VERSION 3.5)
cmake_policy(VERSION 3.5...3.27)
project(example LANGUAGES CXX)

add_executable(example_binary ...)

find_package(hip REQUIRED)
target_link_libraries(example_binary PRIVATE hip::host)

HIP runtime using raw compilers or wrappers

Compiler/wrapper

Compile/Link Flags, Header Files, and Libraries

Virtually all.



First load the rocm or amd-mixed module then,
to compile non HIP code but to include the HIP headers: --rocm-path="${ROCM_PATH}" -isystem "${ROCM_PATH}/include";
to link, the simplest for LLVM based compiler is: --rocm-path="${ROCM_PATH}" --hip-link,
to link, the most versatile (e.g. with GNU) is: --rocm-path="${ROCM_PATH}" -L"${ROCM_PATH}/lib" -lamdhip64.

Warning

See below about the RDC pitfall.

Relocatable Device Code (RDC)

If you require RDC directly via -fgpu-rdc or implicitly through libraries such as Kokkos (Kokkos_ENABLE_HIP_RELOCATABLE_DEVICE_CODE) and you do not use hipcc, you will have to link using the additional --hip-link flag.

For now, the only HIP compilers are derivatives of Clang. As such, RDC is only possible if the linker used comes from either the Cray toolchain, the AMD (ROCm) toolchain or any Clang with HIP support based compiler.

Warning

Failure to specify -fgpu-rdp --hip-link at link time, on a RDC compiled set of object file could result in the following error: error: undefined symbol: __hip_fatbin.

C preprocessor definitions

HIP and CUDA equivalent definitions

Meaning

__CUDA__/__HIP__

(For LLVM based compiler ONLY) It tels you which language the file is to be interpreted as. Respectively defined if -x cu or -x hip or if the file extensions are in .cu or .hip.

__CUDA_ARCH__/__HIP_DEVICE_COMPILE__

It tels you that you are in the GPU code generation pass. It is useful when one want to make use of conditional compilation to distinguish GPU and CPU code paths. Respectively defined if -x cu or -x hip or if the file extensions are in .cu or .hip.

__CUDACC__/__HIPCC__

Which compiler is being used to build the code.

__HIP_PLATFORM_NVIDIA__/__HIP_PLATFORM_AMD__

Should not be used used as the semantic often is not what one wants. Defined in hip/hip_common.h if __HIP__ is defined.

We recommend that you use __CUDACC__ and __HIPCC__ to discriminate CUDA and HIP implementations. This will work for any known CUDA and HIP compiler.

Some environment variables

Feature/flag/environnement variable

Explanation

export AMD_LOG_LEVEL=[0-4]

0: Disable log. 1: Enable log on error level. 2: Enable log on warning and below levels. 3: Enable log on information and below levels. 4: Decode and display AQL packets.

export AMD_SERIALIZE_COPY=<N>

1: Wait for completion before enqueue. 2: Wait for completion after enqueue. 3: Both.

export AMD_SERIALIZE_KERNEL=[0-3]

1: Wait for completion before enqueue. 2: Wait for completion after enqueue. 3: Both.

export CUDA_VISIBLE_DEVICES=<N>

Restrict the devices that your CUDA/HIP application sees. Very important to avoid MPI ranks walking on each other. The AMD runtime supports this environment variable.

export HIP_VISIBLE_DEVICES=<N>

Same as ${CUDA_VISIBLE_DEVICES} but for HIP.

export ROCR_VISIBLE_DEVICES=<N>

Same as ${HIP_VISIBLE_DEVICES} but works at the HSA level. Recommend for highly tunned applications. It restricts the visibility of ${HIP_VISIBLE_DEVICES} and ${CUDA_VISIBLE_DEVICES}. So ROCR_VISIBLE_DEVICES=2 HIP_VISIBLE_DEVICES=2 is same as hiding all the devices. HIP_VISIBLE_DEVICES="0,1,2" HIP_VISIBLE_DEVICES="2" would only show the third device.

export GPU_DEVICE_ORDINAL=<N>

Same as ${CUDA_VISIBLE_DEVICES} but for OpenCL and HIP.

export GPU_MAX_HW_QUEUES=<N>

The variable controls how many independent hardware queues HIP runtime can create per process, per device. If application allocates more HIP streams than this number, then HIP runtime will reuse the same hardware queues for the new streams in round robin manner.

export AMD_DIRECT_DISPATCH=0

Disable direct dispatch (do not use a special thread to launch the kernels).

export HSA_ENABLE_SDMA=[0-1]

It causes host-to-device and device-to-host copies to use compute shader blit kernels rather than the dedicated DMA copy engines. It has been shown that high HtoD and DtoH copy throughput can be achieved this way. At a cost of less compute power available for computation

export HIPCC_VERBOSE=[0-1]

Ask hipcc to dump the compiler command it launches (there is also the --cxxflags and --ldflags).

More details on the AMD GPU debugging options in this document.

Note

While not specific to ROCm, the --save-temps option is of the upmost importance when one seeks to dive deeper into the machine and assembly code.

HIP compiler flag details

Feature/flag/environnement variable

Explanation

--rocm-path=<value>/export ROCM_PATH=<value>

ROCm installation path, used for deriving the HIP_PATH.

--hip-path=<value>/export HIP_PATH=<value>

(Generally not used) (Is automatically derived from the ROCM_PATH or the equivalent flag) HIP runtime installation path, used for the HIP include path, and library path (-L).

--hip-link

Instruct LLVM based compilers with HIP support to handle advanced HIP linking. Thus is mandatory for RDC. This implicitly link against the current ROCm’s amdhip64.

The source of truth remains at this location.

Sycl

OneAPI

On Adastra, OneAPI (containing the Intel DPC++ compiler) is exposed through the PrgEnv-intel and intel Cray modules. A standalone version is available though the intel-oneapi module (it does not pull CrayPE). The intel and amd modules serve the same role. Similarly for intel-oneapi and rocm. The version of DPC++ provided is quite old (2022.1). You may need to recompile it, see below.

Note

Make sure, if you compile your own toolchain, that your ${PATH} and ${LD_LIBRARY_PATH} environment variables point towards it!

You can test if a Sycl toolchain is working as expected on MI250 GPUs like so; compile a Sycl hello world:

$ cat << EOF > hello.cc
#include <mpi.h>

#include <iostream>
#include <sycl/sycl.hpp>

int main(int argc, char** argv) {
    ::MPI_Init(&argc, &argv);

    size_t sz = 1000;

    sycl::buffer<int> buf(sz);

    std::cout << "Device name : "
            << sycl::queue{}.get_device().get_info<sycl::info::device::name>()
            << std::endl;

    sycl::queue{}.submit([&](sycl::handler& cgh) {
                    sycl::accessor acc{buf, cgh, sycl::write_only, sycl::no_init};

                    cgh.parallel_for(sycl::range<1>{sz}, [=](sycl::item<1> id) {
                        acc[id] = id.get_linear_id();
                    });
                })
        .wait();

    sycl::host_accessor acc{buf, sycl::read_only};

    std::cout << "expected : 999 | found : " << acc[sz - 1] << "\n";

    ::MPI_Finalize();
    return 0;
}
EOF
$ # NOTE: We use clang++, but your compiler may be called an other name, even if
$ # its clang based.
$ # Our example uses MPI, we have to had the include and linker flags.
$ clang++ -O3 \
      -isystem "${CRAY_MPICH_PREFIX}/include" \
      -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx90a \
      --rocm-path="${ROCM_PATH}" \
      hello.cc -c
$ clang++ -O3 \
      -L"${CRAY_MPICH_PREFIX}/lib" -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a} \
      -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx90a \
      --rocm-path="${ROCM_PATH}" \
      hello.o -o hello

Run the example:

$ srun --account=<account_to_charge> \
      --nodes=1 \
      --constraint=MI250 \
      --job-name="interactive" \
      --time=1:00:00 \
      --exclusive \
      --ntasks-per-node=8 \
      -- ./hello | head
srun: job 319271 queued and waiting for resources
srun: job 319271 has been allocated resources
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
expected : 999 | found : 999
expected : 999 | found : 999

Compiling Intel LLVM

To enjoy the latest release of Intel LLVM you will have to build it yourself. Its embarrassingly easy, as all software build should be ! (right ?):

#!/bin/bash

set -eu

# Cloning the code on a login node.
git clone --branch="sycl" https://github.com/intel/llvm intel-llvm-src || true
cd intel-llvm-src

module purge

module load cpe/23.12
module load cray-python
module load rocm/5.7.1

module list

INSTALL_PATH="$(pwd)/install"

rm -rf ./build ./install

python3 buildbot/configure.py \
    --hip \
    --cmake-opt="-DCMAKE_C_COMPILER=amdclang" \
    --cmake-opt="-DCMAKE_CXX_COMPILER=amdclang++" \
    --cmake-opt="-DSYCL_BUILD_PI_HIP_ROCM_DIR=${ROCM_PATH}" \
    --cmake-opt="-DCMAKE_INSTALL_PREFIX=${INSTALL_PATH}" \
    --cmake-gen="Ninja"

cd build

time ninja -k0 all lib/all tools/libdevice/libsycldevice
time ninja -k0 install

echo "INSTALL_PATH=\"${INSTALL_PATH}\""

Additional detail is given here: https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md

Ensure your environment points towards the toolchain and that a PrgEnv-amd, amd, amd-mixed or rocm module is loaded:

$ export PATH="${INSTALL_PATH}/bin:${PATH}"
$ export LD_LIBRARY_PATH="${INSTALL_PATH}/lib:${LD_LIBRARY_PATH}"

Note

Ensure that you have recent CMake and Ninja versions. We provide guides on how to setup theses tools here for CMake and here for Ninja.

Note

In the example compilation above we explicitly tell LLVM’s CMake machinery to use the amdclang compiler, you can choose another one.

AdaptiveCPP

Note

hipsycl was rebrand as AdaptiveCPP for legal reasons.

CINES proposes a hipsycl installation done using Spack.

$ module spider hipsycl
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
hipsycl: hipsycl/0.9.4-mpi-python3
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    You will need to load all module(s) on any one of the lines below before the "hipsycl/0.9.4-mpi-python3" module is available to load.

    CCE-GPU-2.0.0
    CCE-GPU-2.1.0
    develop  CCE-GPU-3.0.0

We do not propose that you use this toolchain to compile Sycl code on Adastra as hipsycl 0.9.4 is quite old. Prefer Intel OneAPI as detailed above.

The transformers & datasets libraries

The python transformers and datasets libraries are a simple way to download datasets (that are not already available on Adastra’s DSDIR), download models and tokenizers, create inference pipelines and fine tune models.

  • We believe the datasets and models should preferably be on the scratch or the work storage areas. You can tell the library to cache the datasets and models in a user defined directory like so:

$ export HF_HOME="${SCRATCHDIR}/USER_DSDIR"

Alternatively, you can programmatically, tell the HuggingFace libraries where to cache the data using the cache_dir argument.

import os

# Store the data in the scratch.
cache_dir = os.path.join(os.environ['SCRATCHDIR'], 'USER_DSDIR')

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('Equall/Saul-Instruct-v1', cache_dir=cache_dir)
model = AutoModelForCausalLM.from_pretrained('Equall/Saul-Instruct-v1', cache_dir=cache_dir)

from datasets import load_dataset
squad_dataset = load_dataset('squad', cache_dir=cache_dir)

If not specified, HF_HOME or cache_dir will likely point to your home directory’s .cache/huggingface folder.

Warning

Using HF_HOME is not same as using cache_dir=${HF_HOME}.

  • You will run your computation on compute nodes who do not have internet access. Use HF_HUB_OFFLINE as shown below to avoid timeout during your code start up sequence. This implies you have downloaded your data (using say, from_pretrained) on a login node, prior to running your computations.

Some environment variables

Feature/flag/environnement variable

Explanation

export HF_HOME="${SCRATCHDIR}/USER_DSDIR"

Specify where to put the cached data. Defaults to ~/.cache/huggingface. We recommend something like: ${SCRATCHDIR}/USER_DSDIR.

export HF_HUB_OFFLINE=1

This prevent having to wait for a network connection timeout when using HuggingFace’s libraries (say, transformers or datasets).

export TRANSFORMERS_OFFLINE=1

Fine grain version of HF_HUB_OFFLINE for transformer’s Auto***.from_pretrained and the like.

export HF_DATASETS_OFFLINE=1

Fine grain version of HF_HUB_OFFLINE for datasetsload_dataset.

export TRANSFORMERS_VERBOSITY="info"

Transformer library logging level.

VLLM

Note

You may benefit from using a Python virtual environment, check this document on how to setup one.

Requirements:

You can compile VLLM on Adastra using the commands below.

Copy this patch in a file named rocm.patch.

--- /include/hip/hcc_detail/amd_hip_bf16.h  2023-11-24 19:20:12.000000000 +0100
+++ include/hip/hcc_detail/amd_hip_bf16.h   2024-03-06 14:18:03.000000000 +0100
@@ -133,7 +133,7 @@
  * \ingroup HIP_INTRINSIC_BFLOAT16_CONV
  * \brief Converts float to bfloat16
  */
-__HOST_DEVICE__ __hip_bfloat16 __float2bfloat16(float f) {
+__HOST_DEVICE__ inline __hip_bfloat16 __float2bfloat16(float f) {
   __hip_bfloat16 ret;
   union {
     float fp32;
@@ -177,7 +177,7 @@
  * \ingroup HIP_INTRINSIC_BFLOAT162_CONV
  * \brief Converts and moves bfloat162 to float2
  */
-__HOST_DEVICE__ float2 __bfloat1622float2(const __hip_bfloat162 a) {
+__HOST_DEVICE__ inline float2 __bfloat1622float2(const __hip_bfloat162 a) {
   return float2{__bfloat162float(a.x), __bfloat162float(a.y)};
 }

@@ -205,7 +205,7 @@
  * \ingroup HIP_INTRINSIC_BFLOAT162_CONV
  * \brief Convert double to __hip_bfloat16
  */
-__HOST_DEVICE__ __hip_bfloat16 __double2bfloat16(const double a) {
+__HOST_DEVICE__ inline __hip_bfloat16 __double2bfloat16(const double a) {
   return __float2bfloat16((float)a);
 }

@@ -213,7 +213,7 @@
  * \ingroup HIP_INTRINSIC_BFLOAT162_CONV
  * \brief Convert float2 to __hip_bfloat162
  */
-__HOST_DEVICE__ __hip_bfloat162 __float22bfloat162_rn(const float2 a) {
+__HOST_DEVICE__ inline __hip_bfloat162 __float22bfloat162_rn(const float2 a) {
   return __hip_bfloat162{__float2bfloat16(a.x), __float2bfloat16(a.y)};
 }

@@ -243,7 +243,7 @@
  * \ingroup HIP_INTRINSIC_BFLOAT162_CONV
  * \brief Converts high 16 bits of __hip_bfloat162 to float and returns the result
  */
-__HOST_DEVICE__ float __high2float(const __hip_bfloat162 a) { return __bfloat162float(a.y); }
+__HOST_DEVICE__ inline float __high2float(const __hip_bfloat162 a) { return __bfloat162float(a.y); }

 /**
  * \ingroup HIP_INTRINSIC_BFLOAT162_CONV
@@ -271,7 +271,7 @@
  * \ingroup HIP_INTRINSIC_BFLOAT162_CONV
  * \brief Converts low 16 bits of __hip_bfloat162 to float and returns the result
  */
-__HOST_DEVICE__ float __low2float(const __hip_bfloat162 a) { return __bfloat162float(a.x); }
+__HOST_DEVICE__ inline float __low2float(const __hip_bfloat162 a) { return __bfloat162float(a.x); }

 /**
  * \ingroup HIP_INTRINSIC_BFLOAT162_CONV

With the working directory similar to rocm.patch, execute:

#!/bin/bash

set -eu

# Cloning the code on a login node.
git clone --branch="v0.3.3" https://github.com/vllm-project/vllm || true

# Build the code on a COMPUTE NODE and installing it in the current Python
# environment.
module purge

module load cpe/23.12
module load craype-accel-amd-gfx90a craype-x86-trento
module load PrgEnv-gnu
module load amd-mixed
module load cray-python

module list

# Use a Python VEnv unless you install everything in your home.
# Source a VEnv if needed.
# source ./python_environment/bin/activate
# Ensure you have installed PyTorch for the correct GPUs.

# VLLM shenanigans require that we modify some ROCm headers.
# Unless this trick is done, we get duplicated symbols such as
# __float2bfloat16 (in fact, they use device function tht are not inline nor
# static).
NEW_ROCM_PATH="$(pwd)/$(basename -- "${ROCM_PATH}")"

if [[ ! -e "${NEW_ROCM_PATH}" ]]; then
    cp -rv "$(
        cd -- "${ROCM_PATH}" >/dev/null 2>&1
        pwd -P
    )" "${NEW_ROCM_PATH}" || true
fi

export ROCM_PATH="${NEW_ROCM_PATH}"
export PATH="${ROCM_PATH}/bin:${PATH}"
export HIP_LIB_PATH="${ROCM_PATH}/lib"
export LD_LIBRARY_PATH="${HIP_LIB_PATH}:${LD_LIBRARY_PATH}"
export CMAKE_PREFIX_PATH="${ROCM_PATH}/lib/cmake:${CMAKE_PREFIX_PATH}"

cd -- "${ROCM_PATH}"

patch --strip=1 --backup -f < ../rocm.patch || true

cd ../vllm

pip3 install --no-deps xformers==0.0.23

bash patch_xformers.rocm.sh

pip3 install --upgrade -r requirements-rocm.txt

export PYTORCH_ROCM_ARCH="gfx90a"

pip3 install --verbose --no-cache-dir --no-build-isolation ./

Note

You may have to change the CPE and ROCm (amd-mixed) versions as Adastra evolves.