Libraries & toolchains
CINES Spack modules
CINES provides libraries built using Spack. To access these products, check this document and or look at this product catalog.
Example script for CINES Spack modules libraries
Warning
Work in progress.
Apex
Note
You may benefit from using a Python virtual environment, check this document on how to setup one.
Requirements:
A recent install of PyTorch for ROCm (say in a VEnv).
You can compile Apex on Adastra using the commands below.
#!/bin/bash
set -eu
# Cloning the code on a login node.
git clone --branch="release/1.2.0" https://github.com/ROCm/apex || true
cd apex
# Build the code on a COMPUTE NODE and installing it in the current Python
# environment.
module purge
module load cpe/23.12
module load craype-accel-amd-gfx90a craype-x86-trento
module load PrgEnv-gnu
module load amd-mixed
module load cray-python
module list
# Your VEnv may hav been deactivated due to the modules loaded above.
# Source a VEnv if needed.
# source ./python_environment/bin/activate
# Ensure you have installed PyTorch for the correct GPUs.
pip3 install ninja
pip3 install --verbose --no-cache-dir --no-build-isolation --config-settings="--build-option=--cpp_ext" --config-settings="--build-option=--cuda_ext" ./
Warning
Apex has strong PyTorch and ROCm version dependencies. Check this document for supported PyTorch versions for a given Apex version.
Deepspeed
When it comes to training large models, it becomes crucial to efficiently distribute the weights, gradient and optimizer state across devices. Deepspeed provides first class AMD support for such functionalities.
Note
You may benefit from using a Python virtual environment, check this document on how to setup one.
Requirements:
A recent install of PyTorch for ROCm (say in a VEnv).
JITed kernels
The default and most simple way to install Deepspeed is to pip3 install deepspeed
, and let torch’s JIT C++ extension loader do the compiling at runtime.
Prebuilt kernels
You can compile DeepSpeed on Adastra using the commands below.
#!/bin/bash
set -eu
# Cloning the code on a login node.
git clone --branch="v0.14.1" https://github.com/microsoft/DeepSpeed || true
cd DeepSpeed
# Build the code on a COMPUTE NODE and installing it in the current Python
# environment.
module purge
module load cpe/23.12
module load craype-accel-amd-gfx90a craype-x86-trento
module load PrgEnv-gnu
module load amd-mixed
module load cray-python
module list
# Your VEnv may hav been deactivated due to the modules loaded above.
# source ./python_environment/bin/activate
# Ensure you have installed PyTorch for the correct GPUs.
# Ensure we don't have deepspeed already. Indeed it is a common error to first
# install deepspeed and then try to re-install with prebuilt kernels, it ends up
# doing nothing! To ensure we rebuild, the "--no-cache-dir" is also helpful.
pip3 uninstall --yes deepspeed
export DS_ACCELERATOR="cuda" # Compile to HIP even on login node.
export PYTORCH_ROCM_ARCH="gfx90a"
export DS_BUILD_OPS=1
export DS_BUILD_AIO=0 # 0 KO missing libaio
export DS_BUILD_CCL_COMM=0 # 0 KO missing oneapi
export DS_BUILD_CPU_ADAGRAD=1
export DS_BUILD_CPU_ADAM=1
export DS_BUILD_CPU_LION=1
export DS_BUILD_EVOFORMER_ATTN=0 # 0 KO, idk why
export DS_BUILD_FP_QUANTIZER=0 # 0 KO cuda_bf16. not hip/hip_bf16.
export DS_BUILD_FUSED_ADAM=1
export DS_BUILD_FUSED_LAMB=1
export DS_BUILD_FUSED_LION=1
export DS_BUILD_INFERENCE_CORE_OPS=0 # 0 KO blas_utils_hip.h:60:1: error: ‘rocblas_operation’ does not name a type
export DS_BUILD_CUTLASS_OPS=0 # 0 KO missing deepspeed-kernels
export DS_BUILD_QUANTIZER=1
export DS_BUILD_RAGGED_DEVICE_OPS=0 # 0 KO missing deepspeed-kernels
export DS_BUILD_RAGGED_OPS=1
export DS_BUILD_RANDOM_LTD=1
export DS_BUILD_SPARSE_ATTN=0 # 0 KO
export DS_BUILD_SPATIAL_INFERENCE=1
export DS_BUILD_STOCHASTIC_TRANSFORMER=1
export DS_BUILD_TRANSFORMER_INFERENCE=0 # 0 KO inference_cublas_wrappers_hip.h:24:20: error: ‘rocblas_operation’ was not declared in this scope
export DS_BUILD_TRANSFORMER=1
pip3 install --verbose --no-cache-dir --no-build-isolation ./
Warning
Regardless of the DS_BUILD_*
options you choose, do run the ds_report
command on a compute node to check if the library is working properly. You should see a bunch of OKAY
in the right column (compatible), and if you choose to build the “ops” (DS_BUILD_*
), into the left column (installed), a bunch of YES
should be present.
Some environment variables
Feature/flag/environnement variable |
Explanation |
---|---|
|
Can be used to factorize a configuration. See this document on what should/can go into this file. |
Fast attention
When it comes to training transformer based LLM, it becomes crucial to efficiently compute the attention mechanism. For Nvidia GPUs, a reference implementation is flash-attn. To ensure a seamless inter operation of a PyTorch code on Nvidia or AMD GPUs, AMD forked the project. That being said, the AMD package is not available in the official PyPi repository, you have to compile it yourself.
Note
You may benefit from using a Python virtual environment, check this document on how to setup one.
Requirements:
A recent install of PyTorch for ROCm (say in a VEnv).
You can compile flash-attn on Adastra using the commands below.
#!/bin/bash
set -eu
# Cloning the code on a login node.
git clone --branch="flash_attention_for_rocm" https://github.com/ROCm/flash-attention || true
cd flash-attention
git submodule init
git submodule update
# Build the code on a COMPUTE NODE and installing it in the current Python
# environment.
module purge
module load cpe/23.12
module load craype-accel-amd-gfx90a craype-x86-trento
module load PrgEnv-gnu
module load amd-mixed
module load cray-python
module list
# Your VEnv may hav been deactivated due to the modules loaded above.
# Source a VEnv if needed.
# source ./python_environment/bin/activate
# Ensure you have installed PyTorch for the correct GPUs.
pip3 install ninja einops packaging
export GPU_ARCHS="gfx90a"
pip3 install --verbose --no-cache-dir --no-build-isolation ./
HDF5
If you bypass the Cray compiler wrappers, you need to do more work and risk mixing incompatible libraries or ABI. Use at your own risk the information given below:
Implementation |
Module |
Compiler or wrapper |
Header Files & Linking |
---|---|---|---|
HDF5 |
|
|
The wrappers will automatically add the header flags and the linker flags. |
Raw compilers. |
First, ensure you have the module loaded then,
to compile:
-isystem "${CRAY_HDF5_DIR}/<compiler>/<version>/include" ;to link:
-L"${CRAY_HDF5_DIR}/<compiler>/<version>/lib" -lhdf5_hl -lhdf5 -lhdf5_hl_cpp -lhdf5_cpp -lhdf5hl_fortran -lhdf5_fortran .Or (more concisely):
to compile:
-isystem "${CRAY_HDF5_PREFIX}/include" ;to link:
-L"${CRAY_HDF5_PREFIX}/lib" -lhdf5_hl -lhdf5 -lhdf5_hl_cpp -lhdf5_cpp -lhdf5hl_fortran -lhdf5_fortran . |
||
Parallel HDF5 |
|
|
The wrappers will automatically add the header flags and the linker flags. |
Raw compilers. |
First, ensure you have the module loaded then,
to compile:
-isystem "${CRAY_HDF5_PARALLEL_DIR}/<compiler>/<version>/include" ;to link:
-L"${CRAY_HDF5_PARALLEL_DIR}/<compiler>/<version>/lib" -lhdf5_hl_parallel -lhdf5_parallel .Or (more concisely):
to compile:
-isystem "${CRAY_HDF5_PARALLEL_PREFIX}/include" ;to link:
-L"${CRAY_HDF5_PARALLEL_PREFIX}/lib" -lhdf5_hl_parallel -lhdf5_parallel . |
Note
The cray-hdf5[-parallel]
modules define the ${HDF5_ROOT}
environment variable. It is equivalent to ${CRAY_HDF5_PARALLEL_PREFIX}
.
Where compiler
can be [AMD|AOCC|CRAYCLANG|GNU|INTEL|NVIDIA]
. variant
is left as a choice for the user:
Generally, one relies on the compiler wrappers like so:
$ module load craype-x86-genoa
$ module load PrgEnv-gnu
$ # The cc, CC, ftn wrappers will automatically add the header flags and the
$ # linker flags.
BLAS, LAPACK & ScaLAPACK
Note
LibSci is based on OpenBLAS (GotoBLAS). You probably do NOT need to compile your own BLAS.
Note
We have know that applications using LibSci can observe better performance. We recommend LibSci if your code allows it.
LibSci
LibSci is a collection of numerical routines. It contains, notably:
a Basic Linear Algebra Subroutines (BLAS);
a C interface to BLAS (CBLAS);
a Basic Linear Algebra Communication Subprograms (BLACS);
a Linear Algebra PACKage (LAPACK);
and a Scalable LAPACK (ScaLAPACK).
Two flavors of this library exist, one for CPUs exposed through the cray-libsci
module and a CPU/GPU hybrid one, exposed through the cray-libsci_acc
module.
CPU
If you bypass the Cray compiler wrappers, you need to do more work and risk mixing incompatible libraries or ABI. Use at your own risk the information given below:
Implementation |
Module |
Compiler or wrapper |
Header Files & Linking |
---|---|---|---|
LibSci |
|
|
The wrappers will automatically add the header flags and the linker flags. |
Raw compilers. |
First, ensure you have the module loaded then,
to compile:
-isystem "${CRAY_PE_LIBSCI_BASE_DIR}/<compiler>/<version>/<cpu>/include" ;to link:
-L"${CRAY_PE_LIBSCI_BASE_DIR}/<compiler>/<version>/<cpu>/lib" -lsci_<compiler> -lsci_<compiler>_<variant> .Or (more concisely, use
${CRAY_PE_LIBSCI_PREFIX} for craype >= 2.7.30, else ${CRAY_LIBSCI_PREFIX_DIR} ):to compile:
-isystem "${CRAY_PE_LIBSCI_PREFIX}/include" ;to link:
-L"${CRAY_PE_LIBSCI_PREFIX}/lib" -lsci_<compiler> -lsci_<compiler>_<variant> . |
Where compiler
can be [AMD|AOCC|CRAYCLANG|GNU|INTEL|NVIDIA]
; cpu
is x86_64
. sci_<compiler>_<variant>
is left as a choice for the user:
Library name |
Variant |
Link type |
Threading |
MPI |
Service provided |
---|---|---|---|---|---|
libsci_<compiler>.a |
- |
Static |
Serial non-threaded |
non-MPI |
BLAS/CBLAS/LAPACK |
libsci_<compiler>.so |
Dynamic shared |
Serial non-threaded |
non-MPI |
||
libsci_<compiler>_mp.a |
|
Static |
Parallel multi-threaded (OpenMP) |
non-MPI |
|
libsci_<compiler>_mp.so |
Dynamic shared |
Parallel multi-threaded (OpenMP) |
non-MPI |
||
libsci_<compiler>_mpi.a |
|
Static |
Serial non-threaded |
MPI |
BLACS/ScaLAPACK
|
libsci_<compiler>_mpi.so |
Dynamic shared |
Serial non-threaded |
MPI |
||
libsci_<compiler>_mpi_mp.a |
|
Static |
Parallel multi-threaded (OpenMP) |
MPI |
|
libsci_<compiler>_mpi_mp.so |
Dynamic shared |
Parallel multi-threaded (OpenMP) |
MPI |
Note
BLAS and LAPACK routines do not require MPI and are contained in the non-MPI version of the libraries. The ScaLAPACK routines are imported by linking with the MPI version of the library. The MPI version of the library implies a dependency on the non-MPI version (i.e., libsci_<compiler>_mpi.so
depends on the routines of libsci_<compiler>.so
). Obviously, this dependency chain does NOT work for static libraries; then you have to explicitly add the libsci_<compiler>.a
.
Warning
The static versions of LibSci for the AMD (ROCm) compiler needs the following additional linker flags: -L${ROCM_PATH}/llvm/lib -lflang -lpgmath -lflangrti -lm -ldl
.
Regarding the OpenMP variants; if you call a LibSci routine inside of an OpenMP parallel region, the routine will always be serial. If you call a routine outside of a parallel region and if the LibSci linked is OpenMP enabled, the routine will be threaded, else it will be serial.
Generally, one relies on the compiler wrappers like so:
$ module load craype-x86-genoa
$ module load PrgEnv-gnu
$ # The cc, CC, ftn wrappers will automatically add the header flags and the
$ # linker flags.
By default, the compiler wrappers will link with a single threaded BLAS, LAPACK and ScaLAPACK. If you add the -fopenmp
flag, the multi-threaded version we be used instead.
$ # Implies (as needed) lsci_<compiler>_mpi.so and lsci_<compiler>.so
$ CC main.cc
$ # Implies (as needed) lsci_<compiler>_mpi_mp.so and lsci_<compiler>_mp.so
$ CC -fopenmp main.cc
More details in:
$ module load cray-libsci
$ man intro_libsci
Note
The serial LibSci should be compatible across toolchain (thanks to proper C interfaces). This is not the case with the multi-threaded versions because the OpenMP backend differs across toolchains.
Warning
If using a mixed PrgEnv (say, PrgEnv-cray-amd
), the ${CRAY_PE_LIBSCI_PREFIX}
variable points towards a build for the C/C++ compiler. For PrgEnv-cray-amd
it would be the CCE. For PrgEnv-gnu-amd
it would be GNU.
CPU/GPU hybrid
The accelerated LibSci is augmented with GPU capabilities. It can decide for itself wether to use the GPU or CPU version of a routine.
Before calling such routines, one needs to initialize the library:
Fortran API:
subroutine libsci_acc_init()
subroutine libsci_acc_finalize()
C/C++ API:
void libsci_acc_init(void);
void libsci_acc_finalize(void);
If you bypass the Cray compiler wrappers, you need to do more work and risk mixing incompatible libraries or ABI. Use at your own risk the information given below:
Implementation |
Module |
Compiler or wrapper |
Header Files & Linking |
---|---|---|---|
LibSci_accel |
|
|
The wrappers will automatically add the header flags and the linker flags. |
Raw compilers. |
First, ensure you have the module loaded then,
to compile:
-isystem "${LIBSCI_ACC_BASE_DIR}/<compiler>/<version>/<cpu>/include" ;to link:
-L"${LIBSCI_ACC_BASE_DIR}/<compiler>/<version>/<cpu>/lib" -lsci_acc_<compiler>_<variant> .Or (more concisely, use
${CRAY_PE_LIBSCI_PREFIX} for craype >= 2.7.30, else ${CRAY_LIBSCI_PREFIX_DIR} ):to compile:
-isystem "${CRAY_LIBSCI_ACC_PREFIX_DIR}/include" ;to link:
-L"${CRAY_LIBSCI_ACC_PREFIX_DIR}/lib" -lsci_acc_<compiler>_<variant> . |
Where compiler
can be [AMD|CRAYCLANG|GNU]
; cpu
is x86_64
. sci_acc_<compiler>_<variant>
’s <variant>
is conditioned by the architecture (gfx90a
for MI250X).
Generally, one relies on the compiler wrappers like so:
$ module load craype-x86-trento craype-accel-amd-gfx90a
$ module load PrgEnv-cray
$ module load cray-libsci_acc
$ # The cc, CC, ftn wrappers will automatically add the header flags and the
$ # linker flags.
Warning
To properly link, the compiler wrappers need loaded architecture modules!
More details in:
$ module load cray-libsci_acc
$ man intro_libsci_acc
Some environment variables
Feature/flag/environnement variable |
Explanation |
---|---|
|
Log profiling data about the Cray BLAS (libsci). |
|
As we mentioned above, LibSci is based on OpenBLAS. These environment variables make the routine calls into LibSci be directly forwarded to OpenBLAS. In such case, LibSci should behave as if it was OpenBLAS. |
|
|
|
OpenBLAS
CINES provides OpenBLAS Spack products, for the codes that demand, require, necessitate, beg, cry for, expect it and prescribe or forbid any other implementation.
Note
You should first try to adapt your code so that it is flexible enough to accept a wrapper provided BLAS, LAPACK. Then use the Cray LibSci.
Check the catalog for the exact module names or use module spider openblas
.
Low level primitives
All the libraries show below are not guaranteed to be available on Adastra. If you have a strong need for a particular library to be installed by CINES, please notify svp@cines.fr.
Operation |
AMD GPU |
Nvidia GPU |
x86 CPU |
---|---|---|---|
BLAS/CBLAS (level 1, 2, 3) |
rocBLAS (hipBLAS)/cray-libsci_acc |
cuBLAS (hipBLAS)/cray-libsci_acc |
cray-libsci/Intel MKL/OpenBLAS/AMD BLIS |
Dense system solvers |
rocSOLVER (hipSOLVER)/cray-libsci_acc/Magma/SLATE |
cuSOLVER (hipSOLVER)/cray-libsci_acc/Magma/SLATE |
cray-libsci/Magma/Intel MKL/OpenBLAS/AMD libFLAME/LAPACK/ScaLAPACK/SLATE |
Sparse BLAS (level 1, 2, 3) |
rocSPARSE (hipSPARSE) |
cuSPARSE (hipSPARSE) |
Intel MKL |
Sparse system solvers (preconditioners and geometric or algebraic multigrid solver) |
rocALUTION/PETSc/Ginkgo |
Nvidia AMGx/PETSc/Ginkgo/HYPRE |
Intel MKL/PETSc/HYPRE |
Discrete Fourier transform |
rocFFT (hipFFT)/vkFFT |
cuFFT (hipFFT)/vkFFT |
cray-fftw/FFTW/Intel MKL |
Deep learning primitives |
MIOpen |
cuDNN |
Caffe/Uncountable |
Algorithmic building blocks |
rocThrust |
Thrust |
Thrust/C++ standard library |
Communication primitives |
MPI/RCCL |
MPI/NCCL |
MPI/PVM |
Pseudo number generation |
rocRAND (hipRAND) |
cuRAND (hipRAND) |
Uncountable |
Low level GPU primitives |
rocPRIM (hipCUB) |
CUB (hipCUB) |
- |
Note
AMD provides compatibility layers prefixed with hip
like hipBLAS. These libraries provide a stable interface allowing the user to change the backend from rocBLAS to cuBLAS easily. Theses compatibility layers should be used instead of the platform specific roc/cuBLAS.
Note
Not all the technologies on a row are strictly equivalent.
MIOpen
If using MIOpen we strongly recommend that you use the environment variables show below. These serves to mitigate the contention that can occur due to the usage of an auto-tunning caching database MIOpen writes in the home directory.
$ export MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}"
$ export MIOPEN_CUSTOM_CACHE_DIR="${MIOPEN_USER_DB_PATH}"
Or (disable auto-tunning caching entirely, less recommended):
$ export MIOPEN_DEBUG_DISABLE_FIND_DB=1
In addition to making large distributed training more scalable, it can prevent issues of the following form:
Failed to store record to find-db at </lus/home/<user_path>/.config/miopen//....ufdb.txt>
Ill-formed record: key not found: /lus/home/<user_path>/.config/miopen//....ufdb.txt
MPI
The provided and thus, default and preferred MPI implementation on Cray systems such as Adastra is Cray’s MPICH.
Cray MPICH
You will find information regarding Cray MPICH in the INTRO_MPI(3)
man page or, as Cray tries to mirror some of it’s documentation online, on this URL: https://cpe.ext.hpe.com/docs/mpt/mpich/index.html#mpich.
Some features of the Cray MPICH implementation are that:
it supports the MPI 3.2 standard;
its GPU aware, meaning it gives the ability to pass a GPU buffer to the MPI API;
its GPU direct, meaning it has the transparent (few visible artifacts) ability to use RDMA technics between GPUs. This can elide copies.
The Cray compiler wrappers justify the lack of the usual MPI compiler wrappers like mpicc
or similar (which often only add link and header flags). You implicitly link with Cray’s MPICH when you use the Cray compiler wrappers (cc
, CC
and ftn
) instead of the common mpicc
, mpicxx
and mpif90
compiler wrappers. The cray-mpich
module is automatically loaded when you load a PrgEnv-*
.
If you bypass the Cray compiler wrappers, you need to do more work and risk mixing incompatible libraries or ABI. Use at your own risk the information given below:
Implementation |
Module |
Compiler or wrapper |
Header Files & Linking |
---|---|---|---|
Cray MPICH |
|
|
The wrappers will automatically add the header flags and the linker flags. |
Raw compilers and GPU-direct MPI. |
First, ensure you have the module loaded. This is implicit to the usage of a
PrgEnv-* then,to compile:
-isystem ${CRAY_MPICH_PREFIX}/include ;to link:
-L${CRAY_MPICH_PREFIX}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a} . |
||
Raw compilers and CPU only MPI. |
First, ensure you have the module loaded. This is implicit to the usage of a
PrgEnv-* then,to compile:
-isystem ${CRAY_MPICH_PREFIX}/include ;to link:
-L${CRAY_MPICH_PREFIX}/lib -lmpi . |
GPU aware & GPU direct
To use the Cray MPICH’s GPU-aware functionality, the user must load some modules and set an environment variable. You need to load the cray-mpich
but know that any PrgEnv
loads it for you. As an example, with PrgEnv-cray
:
$ module purge
$ module load cpe/23.12
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ module load PrgEnv-cray
$ module load amd-mixed
$ # This is the environment variable that tells Cray MPICH to enable GPU aware features.
$ export MPICH_GPU_SUPPORT_ENABLED=1
If you chose to use the Cray compiler wrappers, make sure you correctly load the modules used to target an architecture. If the craype-accel-amd-gfx90a
module is not loaded, the Cray compiler wrappers wont link with the mpi_gtl_hsa
library and at runtime, if you set export MPICH_GPU_SUPPORT_ENABLED=1
you will get an error like that:
MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
Warning
The export MPICH_GPU_SUPPORT_ENABLED=1
environment variable is required for Cray MPICH to provide GPU support through the GPU Transport Layer (GTL) plugin. We will not always add it in the scripts below but do not forget it if needed!
Some environment variables
Feature/flag/environnement variable |
Explanation |
---|---|
|
If enabled, causes MPICH to display the Cray MPI version number as well as build date information. |
|
If set, causes rank 0 to display all MPICH environment variables and their current settings at MPI initialization time. |
|
If set to 1, print a summary of the min/max high water mark and associated rank to stderr. If set to 2, output each rank’s high water mark to a file as specified using |
|
If enabled, causes MPICH to abort and produce a core dump when MPICH detects an internal error. Note that the core dump size limit (usually 0 bytes by default) must be reset to an appropriate value in order to enable coredumps. |
|
Dump the cpuset associated to each rank (verbose). |
|
If enabled, causes rank 0 in the participating communicator to display the names and values of all MPI-IO hints that are set for the file being opened with the |
|
If set, more verbose output will be displayed during |
|
If set to 1, verbose information pertaining to NIC selection is printed at the start of the job. All available NIC domain names, addresses and index values are displayed. Setting this variable to 2 displays additional details, including the specific NIC each rank has been assigned, which is based on |
|
Determines if Cassini (CXI) counters are collected during the application and the verbosity of the counter data report displayed during |
|
If set to a non-zero value, this enables more verbose output about the Cassini counters being collected. Can be helpful for debugging and/or identifying which counters are being collected. Only applicable to Slingshot 11. |
|
Controls the amount of logging data that is output. |
|
Use this for flow control or queue exhaustion problems. We are seeing a lot of these but a burst of communication occurs. |
|
When all the ranks are on the same node, forces the traffic through the NIC instead of using node local optimization (shared memory). |
|
Disable P2P IPC for CPU to CPU, CPU to GPU and GPU to GPU. |
Further reading: man intro_mpi
.
Detecting if MPI is GPU aware
As far as we know, there is no standard way to detect if a MPI implementation is GPU aware, even across MPI extension (MPIx). We propose the following GPU aware MPI check implementation that should work for recent OpenMPI and Cray MPICH.
Note
There is an issue on the official MPI standardization committee forum: https://github.com/mpi-forum/mpi-issues/issues/580; but note that it will take quite come time before it is implemented universally.
The client code would look like that:
#include <mpi.h>
#include <iostream>
#include "mpi_gpu_aware.h"
int main(int argc, char** argv) {
::MPI_Init(&argc, &argv);
#if defined(MPI_GPU_AWARE_API_SUPPORT)
std::cout << "[compiletime] Is MPI GPU aware API supported: true\n";
#else
std::cout << "[compiletime] Is MPI GPU aware API supported: false\n";
#endif
std::cout << "[runtime] Is MPI GPU aware: " << ((::mpi_gpu_aware() < 0) ? "false" : "true") << "\n";
::MPI_Finalize();
return 0;
}
The mpi_gpu_aware.h
code is:
#ifndef MPI_GPU_AWARE_H
#define MPI_GPU_AWARE_H
#include <mpi.h>
#if defined(CRAY_MPICH_VERSION) && defined(MPIX_GPU_SUPPORT_CUDA)
#define MPI_GPU_AWARE_CRAYMPICH_API_SUPPORT 1
#define MPI_GPU_AWARE_CRAYMPICH_POTENTIAL_SUPPORT 1
#elif defined(OPEN_MPI)
#include <mpi-ext.h>
#if(defined(OMPI_HAVE_MPI_EXT_ROCM) && OMPI_HAVE_MPI_EXT_ROCM) || \
(defined(OMPI_HAVE_MPI_EXT_CUDA) && OMPI_HAVE_MPI_EXT_CUDA)
#define MPI_GPU_AWARE_OPENMPI_API_SUPPORT 1
#if defined(MPIX_CUDA_AWARE_SUPPORT) && MPIX_CUDA_AWARE_SUPPORT
#define MPI_GPU_AWARE_OPENMPI_POTENTIAL_SUPPORT 1
#endif
#endif
#endif
#if defined(MPI_GPU_AWARE_CRAYMPICH_API_SUPPORT) || \
defined(MPI_GPU_AWARE_OPENMPI_API_SUPPORT)
/// The MPI implementation supports a runtime API we can use to ensure GPU
/// awareness.
///
#define MPI_GPU_AWARE_API_SUPPORT 1
#endif
#if defined(MPI_GPU_AWARE_CRAYMPICH_POTENTIAL_SUPPORT) || \
defined(MPI_GPU_AWARE_OPENMPI_POTENTIAL_SUPPORT)
/// We assume it is likely that the MPI implementation is GPU aware ?
///
#define MPI_GPU_AWARE_POTENTIAL_SUPPORT 1
#endif
/// GPU aware MPI is a runtime decision on Cray MPICH. It relies on a library
/// called GPU Transport Layer (GTL). This library is linked by the wrapper when
/// a GPU architecture module is loaded (i.e.: craype-x86-trento).
/// It is not possible to determine at compile time if the MPI implementation
/// will be GPU aware. We can know if the MPI could have that feature enabled,
/// but not if it will be enabled.
/// Returns: < 0 if we could not determine if the MPI is GPU aware.
///
extern "C" int mpi_gpu_aware(void);
#endif
The mpi_gpu_aware.cc
code is:
#include <cstdlib>
#include "mpi_gpu_aware.h"
namespace detail {
static inline int
EnsureMPIIsInitialized() {
int is_mpi_initialized = 0;
if((::MPI_Initialized(&is_mpi_initialized) != MPI_SUCCESS) ||
(is_mpi_initialized == 0)) {
return -1;
}
// At this point MPI should be initialized. Except if an other thread
// called MPI_Finalize() in between.
return 0;
}
#if defined(MPI_GPU_AWARE_CRAYMPICH_API_SUPPORT)
static inline int
getenv() {
const char* environment_string = std::getenv("MPICH_GPU_SUPPORT_ENABLED");
if(environment_string == NULL) {
return -1;
}
if(environment_string[0] != '1') {
return -1;
}
return 0;
}
/// Rely on MPIX_GPU_query_support. It requires that we have
/// MPIX_GPU_SUPPORT_CUDA defined to plug into the MPIX_GPU_query_support.
/// https://www.mpich.org/static/docs/v4.0.x/www3/MPIX_GPU_query_support.html
///
static inline int
MPIx() {
int result = 0;
const auto CheckGPUKindSupport = [](int offset, int gpu_kind) -> int {
int is_gpu_kind_supported = 0;
if(::MPIX_GPU_query_support(gpu_kind, &is_gpu_kind_supported) != MPI_SUCCESS) {
return -1;
}
return is_gpu_kind_supported << offset;
};
#if defined(MPIX_GPU_SUPPORT_CUDA)
result |= CheckGPUKindSupport(0, MPIX_GPU_SUPPORT_CUDA);
#endif
#if defined(MPIX_GPU_SUPPORT_ZE)
result |= CheckGPUKindSupport(1, MPIX_GPU_SUPPORT_ZE);
#endif
#if defined(MPIX_GPU_SUPPORT_HIP)
result |= CheckGPUKindSupport(2, MPIX_GPU_SUPPORT_HIP);
#endif
return result == 0 ? -1 : result;
}
static inline int
DoCheck() {
if(getenv() < 0) {
return -1;
}
if(EnsureMPIIsInitialized() < 0) {
return -1;
}
if(MPIx() < 0) {
return -1;
}
return 0;
}
#elif defined(MPI_GPU_AWARE_OPENMPI_API_SUPPORT)
static inline int
MPIx() {
int result = 0;
#if defined(MPI_GPU_AWARE_OPENMPI_API_SUPPORT) && defined(OMPI_HAVE_MPI_EXT_CUDA)
if(::MPIX_Query_cuda_support() != 1) {
return -1;
}
result |= 1 << 0;
#endif
// #if defined(MPI_GPU_AWARE_OPENMPI_API_SUPPORT) && defined(OMPI_HAVE_MPI_EXT_ZERO)
// if(::MPIX_Query_zero_support() != 1) {
// return -1;
// }
// result |= 1 << 2;
// #endif
#if defined(MPI_GPU_AWARE_OPENMPI_API_SUPPORT) && defined(OMPI_HAVE_MPI_EXT_ROCM)
if(::MPIX_Query_rocm_support() != 1) {
return -1;
}
result |= 1 << 2;
#endif
return result == 0 ? -1 : result;
}
static inline int
DoCheck() {
if(EnsureMPIIsInitialized() < 0) {
return -1;
}
if(MPIx() < 0) {
return -1;
}
return 0;
}
#elif !defined(MPI_GPU_AWARE_API_SUPPORT)
static inline int
DoCheck() {
return -1;
}
#endif
} // namespace detail
extern "C" int mpi_gpu_aware(void) {
static const int result = detail::DoCheck();
return result;
}
OpenMP
CPU
Implementation |
Compiler or wrapper |
Header Files & Linking |
---|---|---|
Compiler specific OpenMP |
|
To compile:
-fopenmp ;to link:
-fopenmp . |
Raw compilers. For CCE, AMD, GCC. |
Note
You may observe the -openmp
flag in CMake logs or other documentation. In the CPE context, it is an alias to -fopenmp
.
Note
As of 2023/06/01, the Cray compiler provides full OpenMP 4.5 support and and near complete OpenMP 5.0 and 5.1 support (as it depends on Clang’s OpenMP support). The OpenMP version reported to C and C++ codes thanks to the _OPENMP
preprocessor definition is 201811
which means version 5.0.
GPU offload
This section shows how to compile with OpenMP Offload using the different compilers covered above.
Note
Make sure the craype-accel-amd-gfx90a
module is loaded when using OpenMP offload.
Implementation |
Compiler or wrapper |
Header Files & Linking |
---|---|---|
Compiler specific OpenMP |
|
To compile:
-fopenmp ;to link:
-fopenmp . |
Raw compilers. For CCE, AMD. |
To compile:
-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<hardware> ;to link:
-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<hardware> . |
With <hardware>
set to gfx90a
(MI200).
Warning
(When using CCE + craype-accel-amd-gfx90a
) CMake is known to have some issues with Cray’s GPU offload implementation’s special linker (cce_omp_offload_linker
). This causes link time issues or false positive (running on GPU). Check this document work around this problem.
Warning
(When using CrayPE + AMD + craype-accel-amd-gfx90a
) The Cray wrapper is know to produce bugs like: clang++: fatal error: The option -fopenmp-targets= requires additional options -Xopenmp-target= and -march= .
. A simple fix is to always specify the whole, explicit command -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<hardware>
.
Some environment variables
Feature/flag/environnement variable |
Explanation |
---|---|
|
Instructs the runtime to display the OpenMP version number and the value of the ICVs. |
|
Instructs the runtime to display formatted affinity information for all OpenMP threads in the parallel region upon entering the first parallel region. |
Further reading: man intro_openmp
.
For the Cray OpenMP and OpenACC backend for GPU offloading
Feature/flag/environnement variable |
Explanation |
---|---|
|
(Only for OpenACC and OpenMP offload) When the runtime environment variable |
Further reading: man intro_openmp
, man intro_openacc
.
PyTorch
(Py)Torch is the defacto library for neural network based artificial intelligence.
To ensure competitive performance, take a lock at theses documents: Installing PyTorch for AMD GPUs, RCCL and IDRIS’ accelerate wrapper.
Installing PyTorch for AMD GPUs
By default pip
will install PyTorch for Nvidia GPU. To use PyTorch for AMD GPUs (ROCm stack), you will have to specify a repository like so:
$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7
Warning
You may have issue due to Adastra’s firewall policies preventing access to download.pytorch.org
. To workaround the issue, you can download the packages on your local machine and scp
it to Adastra.
Warning
Be careful with the torch
package. Ensure, you still have the ROCm version at the end of every install.
Note
You may benefit from using a Python virtual environment, check this document on how to setup one.
Torch elastic
Note
If you only use a tiny amount of resource, this document will give you recommendations on how to use the compute resources.
To make use of PyTorch’s distributed features, you can take example from the script below.
#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="<job_name>"
#SBATCH --constraint=MI250
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --time=1:00:00
module purge
module load cray-python
module load aws-ofi-rccl
module list
# Use a Python VEnv unless you install everything in your home.
# source ./python_environment/bin/activate
# NOTE: We recommend that you prefetch your datasets and models on login
# nodes and use the HF_HOME or the DS_DIR environment variable.
# export HF_HOME="${SCRATCHDIR}/USER_DSDIR"
export HF_HUB_OFFLINE=1
export MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}"
export MIOPEN_CUSTOM_CACHE_DIR="${MIOPEN_USER_DB_PATH}"
################################################################################
# Run
################################################################################
srun --ntasks-per-node=1 --gpus-per-task=8 --cpu-bind=none --mem-bind=none --label \
-- torchrun --nnodes="${SLURM_NNODES}" --nproc_per_node="8" \
--rdzv-id="${SLURM_JOBID}" \
--rdzv-backend=c10d \
--rdzv-endpoint="$(scontrol show hostname ${SLURM_NODELIST} | head -n 1):29400" \
--max-restarts="0" \
-- ./script.py
Note
nproc_per_node
should be equal to gpus-per-task
. This is because torchrun
will spawn nproc_per_node
child processes, each with its own GPU. This explains why we have ntasks-per-node=1
. We only run one torchrun
task per node.
Note
The LOGLEVEL="INFO"
environment variable may clutter the log. You can comment them if wanted.
IDRIS’ accelerate
wrapper
IDRIS provides a wrapper abstracting the task scheduler (say, SLURM) and the parallelization framework (PyTorch’s DDP or FSDP, Microsoft’s DeepSpeed) through HuggingFace’s Accelerate. A typical installation can be done like so:
$ pip3 install transformers deepspeed accelerate datasets torchmetrics idr-torch idr_accelerate
Warning
Be careful with the torch
package. Ensure, you still have the ROCm version at the end of every install.
Note
You may benefit from using a Python virtual environment, check this document on how to setup one.
To use idr_accelerate
, we provide a example script below. It will run an example script given in: https://github.com/idriscnrs/idr_accelerate/blob/main/examples/multi-node/train_accelerate.py, provided you have downloaded the dataset and model configuration.
#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="<job_name>"
#SBATCH --constraint=MI250
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --time=1:00:00
module purge
module load cray-python
module load aws-ofi-rccl
module list
# Use a Python VEnv unless you install everything in your home.
# source ./python_environment/bin/activate
# NOTE: We recommend that you prefetch your datasets and models on login
# nodes and use the HF_HOME or the DS_DIR environment variable.
# export HF_HOME="${SCRATCHDIR}/USER_DSDIR"
export HF_HUB_OFFLINE=1
export MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}"
export MIOPEN_CUSTOM_CACHE_DIR="${MIOPEN_USER_DB_PATH}"
################################################################################
# Run
################################################################################
srun --ntasks-per-node=1 --gpus-per-task=8 --cpu-bind=none --mem-bind=none --label \
-- idr_accelerate train_accelerate.py \
--lr 0.5 \
--model_dir="..." \
--data_path="..."
Note
The LOGLEVEL="INFO"
environment variable may clutter the log. You can comment them if wanted.
Some environment variables
Feature/flag/environnement variable |
Explanation |
---|---|
|
PyTorch logging level. |
export MIOPEN_USER_DB_PATH=/tmp/${USER}-miopen-cache-${SLURM_JOB_ID} export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH} |
Prevent the ranks from starving the filesystem, trying to atomically update a SQLite database.
|
RCCL
RCCL is AMD’s ROCm stack’s fork of Nvidia’s NCCL. This stands for ROCm Communication Collectives Library (RCCL).
Adastra is built on top of a fast interconnect dubbed Slingshot by Cray. To use this interconnect when doing distributed training, you should load the aws-ofi-rcc
module which introduces a LibFabric Slingshot shim. Also, you should remove all NCCL or RCCL environment variables that could interfere with the libraries’ inner working (or use them at your own risk). As an example:
#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="<job_name>"
#SBATCH --constraint=MI250
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --time=1:00:00
module purge
module load cray-python
module load aws-ofi-rccl # <- This matters !
module list
srun ..
Warning
If you are using RCCL
as part of PyTorch or any other product that comes with its own ROCm libraries (libamdhip64
), you do not need to load a rocm
or amd-mixed
module. Else, you should load either amd-mixed
or rocm
on which, more detail is given in this document.
Debugging networking issues
In the scripts you may use the following logging environment variables. The logs get be very verbose:
export NCCL_DEBUG="TRACE"
export NCCL_DEBUG_SUBSYS="ALL"
Note
To ensure you are using the Slingshot interconnect, assuming you have defined the verbose logging environment variable above, you should find the following lines at the start of the logs: NCCL INFO NET/OFI Using aws-ofi-rccl 1.4.0
and NCCL INFO NET/OFI Selected Provider is cxi
ROCm
Warning
ROCm is generally backward compatible which means you can use a recent ROCm with a binary built using an older ROCm. It is not forward compatible, meaning using a recent binary using an older ROCm may not work.
Modules and paths
The ROCm toolchain can be used by loading a module providing ROCm.
$ # A standalone ROCm.
$ module load rocm
$ # A standalone ROCm with a Cray compiler in the mix.
$ module load amd-mixed
$ # Or a programming environment.
$ module load PrgEnv-amd
The libraries can be found in the ${ROCM_PATH}/lib
directory. Note that the ${ROCM_PATH}
environment variable is defined by the modules given above.
The ROCm libraries provided are:
amd_comgr
amd-dbgapi
AMDDeviceLibs
hip
hip-lang
hsa-runtime64
rocm_smi
miopen
rccl
rocalution
rocblas & hipblas
rocfft & hipfft
rocrand & hiprand
rocsolver & hipsolver
rocsparse & hipsparse
rocprim & hipcub
rocthrust
AMD is extensively improving its documentation. You can find a trove of example on mixing CMake with ROCm here.
Starting with ROCm 5.1.0, here are the CMake prefixes for the provided libraries:
set(amd_comgr_DIR "$ENV{ROCM_PATH}/lib/cmake/amd_comgr")
set(AMDDeviceLibs_DIR "$ENV{ROCM_PATH}/lib/cmake/AMDDeviceLibs")
set(hip_DIR "$ENV{ROCM_PATH}/lib/cmake/hip")
set(hsa-runtime64_DIR "$ENV{ROCM_PATH}/lib/cmake/hsa-runtime64")
set(rocm_smi_DIR "$ENV{ROCM_PATH}/lib/cmake/rocm_smi")
set(miopen_DIR "$ENV{ROCM_PATH}/lib/cmake/miopen")
set(rccl_DIR "$ENV{ROCM_PATH}/lib/cmake/rccl")
set(rocalution_DIR "$ENV{ROCM_PATH}/lib/cmake/rocalution")
set(rocblas_DIR "$ENV{ROCM_PATH}/lib/cmake/rocblas")
set(hipblas_DIR "$ENV{ROCM_PATH}/lib/cmake/hipblas")
set(rocfft_DIR "$ENV{ROCM_PATH}/lib/cmake/rocfft")
set(hipfft_DIR "$ENV{ROCM_PATH}/lib/cmake/hipfft")
set(rocrand_DIR "$ENV{ROCM_PATH}/lib/cmake/rocrand")
set(hiprand_DIR "$ENV{ROCM_PATH}/lib/cmake/hiprand")
set(rocsolver_DIR "$ENV{ROCM_PATH}/lib/cmake/rocsolver")
set(hipsolver_DIR "$ENV{ROCM_PATH}/lib/cmake/hipsolver")
set(rocsparse_DIR "$ENV{ROCM_PATH}/lib/cmake/rocsparse")
set(hipsparse_DIR "$ENV{ROCM_PATH}/lib/cmake/hipsparse")
set(rocprim_DIR "$ENV{ROCM_PATH}/lib/cmake/rocprim")
set(hipcub_DIR "$ENV{ROCM_PATH}/lib/cmake/hipcub")
set(rocthrust_DIR "$ENV{ROCM_PATH}/lib/cmake/rocthrust")
The CMAke package and component one should link against are:
Library |
CMake package name |
CMake component |
---|---|---|
HIP |
|
|
rocPRIM |
|
|
rocThrust |
|
|
hipcub |
|
|
rocRAND |
|
|
rocSOLVER |
|
|
hipBLAS |
|
|
rocFFT |
|
|
hipFFT |
|
|
rocSPARSE |
|
|
hipSPARSE |
|
|
rocALUTION |
|
|
RCCL |
|
|
MIOpen |
|
|
MIGraphX |
|
|
Build integration
Using CMake to properly consuming a ROCm library
cmake_minimum_required(VERSION 3.5)
cmake_policy(VERSION 3.5...3.27)
project(example LANGUAGES CXX)
add_library(example_library ...)
find_package(miopen)
target_link_libraries(example_library PUBLIC MIOpen)
HIP as a CMake language
cmake_minimum_required(VERSION 3.21) # HIP language support requires 3.21
cmake_policy(VERSION 3.21.3...3.27)
project(example LANGUAGES HIP)
add_executable(example_binary main.hip)
# If you have .cu files, you can tell CMake to treat them as HIP like so:
add_library(example_library library.cu)
set_source_files_properties(library.cu PROPERTIES LANGUAGE HIP)
Note
CMAKE_HIP_ARCHITECTURES
is a semicolon variable that defines the targeted devices. Example: gfx801;gfx900;gfx90a
.
HIP using raw compilers or wrappers
Warning
Compiling HIP code is different than just using and linking against the HIP runtime. See Using CMake to properly consume the HIP runtime.
Warning
If you use hipcc
, you may want to use the ${HIPCC_COMPILE_FLAGS_APPEND}
and ${HIPCC_LINK_FLAGS_APPEND}
environment variables to specify your flags. Else, the hipcc
wrapper may ignore them.
This section shows how to compile HIP codes using the Cray compiler wrappers and hipcc
(itself an amdclang
wrapper).
Note
Make sure the craype-accel-amd-gfx90a
module is loaded when compiling HIP with the Cray compiler wrappers.
Compiler |
Compile/Link Flags, Header Files, and Libraries |
---|---|
CC Only with:
PrgEnv-cray PrgEnv-amd |
First load the
rocm or amd-mixed module then,to compile:
--rocm-path="${ROCM_PATH}" -xhip ;to link:
--rocm-path="${ROCM_PATH}" -L"${ROCM_PATH}/lib" -lamdhip64 . |
hipcc |
Can be used directly to compile and link with HIP source/object files.
To see what is being invoked within this compiler wrapper, issue the command,
hipcc --verbose .To explicitly target AMD MI250X, use
--offload-arch=gfx90a . |
CC and hipcc With:
CC as a non HIPcompiler like
PrgEnv-gnu . |
The GNU compilers cannot be used to compile HIP code, so all HIP kernels must be separated from CPU code.
During compilation, all non-HIP files must be compiled with
CC while HIP kernels must be compiled with hipcc .Then linking can be performed as done above for the AMD/Cray vendors.
First load the
rocm or amd-mixed module then,to compile non HIP code:
CC <file.cpp> ;to compile HIP code:
hipcc --offload-arch=gfx90a ;to link:
CC --rocm-path="${ROCM_PATH}" -L"${ROCM_PATH}/lib" -lamdhip64 . |
The ROCm modules (rocm
or amd-mixed
) introduce several environment variables for HIP programming. The command hipconfig --full
can be used to check some environment related information.
Warning
By default, if you do not specify the --rocm-path="${ROCM_PATH}"
, Clang will take the ROCm install present in /opt/rocm
or pointed to by the ${ROCM_PATH}
environment variable.
Warning
By default amdclang
(HIP-Clang) and CrayClang (actually all Clang based GPU code compilers) assumes -ffp-contract=fast-honor-pragmas
(less volatile than -Ofast
). Users can use #pragma clang fp contract(on|off|fast)
to control fp contraction of a block of code. For x86_64, FMA is off by default since the generic x86_64 target does not support FMA by default. To turn on FMA on x86_64, either use -mfma
or -march=native
on CPUs supporting FMA. When contractions are enabled and the CPU has not enabled FMA instructions, the GPU can produce different numerical results than the CPU for expressions that can be contracted. Tolerance should be used for floating point comparisons. We recommend that you read the HIP Programming Manual.
Warning
We do NOT recommend that you use amdclang
directly, always try to use the hipcc
wrapper instead. that said, if you must use amdclang
, know that you have to pass the -isystem "${ROCM_PATH}/include" --hip-device-lib-path="${ROCM_PATH}/amdgcn/bitcode" --rocm-path=${ROCM_PATH}
flags.
An other way to check which flags to use for compilation and linking is to call hipconfig
. On a ROCm 5.7.1 it could return:
$ hipconfig
HIP version : 5.7.31921-1949b1621
== hipconfig
HIP_PATH : /lus/home/softs/rocm/5.7.1
ROCM_PATH : /opt/rocm-5.7.1
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME : rocclr
CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/lus/home/softs/rocm/5.7.1/include -I/lus/home/softs/rocm/5.7.1/llvm/lib/clang/17.0.0
== hip-clang
HIP_CLANG_PATH : /opt/rocm-5.7.1/llvm/bin
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.7.1 23382 f3e174a1d286158c06e4cc8276366b1d4bc0c914)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.7.1/llvm/bin
AMD LLVM version 17.0.0git
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: znver4
Registered Targets:
amdgcn - AMD GCN GPUs
r600 - AMD GPUs HD2XXX-HD6XXX
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -isystem "/lus/home/softs/rocm/5.7.1/include" -O3
hip-clang-ldflags : -O3 --hip-link --rtlib=compiler-rt -unwindlib=libgcc
=== Environment Variables
PATH=/opt/rocm-5.7.1/bin:/lus/home/BCINES/dci/user/.local/bin:/lus/home/BCINES/dci/user/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/c3/bin:/sbin:/bin
LD_LIBRARY_PATH=/opt/rocm-5.7.1/lib/roctracer:/opt/rocm-5.7.1/lib/rocprofiler:/opt/rocm-5.7.1/lib
HIP_LIB_PATH=/opt/rocm-5.7.1/lib
== Linux Kernel
Hostname : login5
Linux login5 4.18.0-477.10.1.el8_8.x86_64 #1 SMP Wed Apr 5 13:35:01 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: RedHatEnterprise
Description: Red Hat Enterprise Linux release 8.8 (Ootpa)
Release: 8.8
Codename: Ootpa
HIP + OpenMP CPU threading using raw compilers or wrappers
This section shows how to compile HIP + OpenMP CPU threading hybrid codes. Note that HIP kernels may not appear in translation units with OpenMP support enabled. However, HIP runtime API calls may be used in such previously described translation units.
Note
Make sure the craype-accel-amd-gfx90a
module is loaded when compiling HIP with the Cray compiler wrappers.
Compiler/wrapper |
Compile/Link Flags, Header Files, and Libraries |
---|---|
CC Only with:
PrgEnv-cray PrgEnv-amd |
First load the
rocm or amd-mixed module then,to compile:
-fopenmp --rocm-path="${ROCM_PATH}" -xhip ;to link:
-fopenmp --rocm-path="${ROCM_PATH}" -L"${ROCM_PATH}/lib" -lamdhip64 . |
hipcc |
First load the
rocm or amd-mixed module then,to compile:
-fopenmp ;to link:
-fopenmp .To explicitly target AMD MI250X, use
--offload-arch=gfx90a at compile time. |
CC and hipcc With:
CC as a non HIPcompiler like
PrgEnv-gnu . |
The GNU compilers cannot be used to compile HIP code, so all HIP kernels must be separated from OpenMP CPU code.
During compilation, all non-HIP files must be compiled with
CC while HIP kernels must be compiled with hipcc .Then linking can be performed as done above for the AMD/Cray vendors.
First load the
rocm or amd-mixed module then,to compile non HIP code:
CC -fopenmp ;to compile HIP code:
hipcc --offload-arch=gfx90a ;to link:
CC -fopenmp --rocm-path="${ROCM_PATH}" -L"${ROCM_PATH}/lib" -lamdhip64 . |
Using CMake to properly consume the HIP runtime
Above, we introduced a way to compile .hip
files. Sometimes you only want to have access to the HIP runtime (say, hipMalloc
) through the C API. This does not require a special device compiler, we only need the header and libraries. Under CMake it is done like so:
cmake_minimum_required(VERSION 3.5)
cmake_policy(VERSION 3.5...3.27)
project(example LANGUAGES CXX)
add_executable(example_binary ...)
find_package(hip REQUIRED)
target_link_libraries(example_binary PRIVATE hip::host)
HIP runtime using raw compilers or wrappers
Compiler/wrapper |
Compile/Link Flags, Header Files, and Libraries |
---|---|
Virtually all.
|
First load the
rocm or amd-mixed module then,to compile non HIP code but to include the HIP headers:
--rocm-path="${ROCM_PATH}" -isystem "${ROCM_PATH}/include" ;to link, the simplest for LLVM based compiler is:
--rocm-path="${ROCM_PATH}" --hip-link ,to link, the most versatile (e.g. with GNU) is:
--rocm-path="${ROCM_PATH}" -L"${ROCM_PATH}/lib" -lamdhip64 . |
Warning
See below about the RDC pitfall.
Relocatable Device Code (RDC)
If you require RDC directly via -fgpu-rdc
or implicitly through libraries such as Kokkos (Kokkos_ENABLE_HIP_RELOCATABLE_DEVICE_CODE
) and you do not use hipcc
, you will have to link using the additional --hip-link
flag.
For now, the only HIP compilers are derivatives of Clang. As such, RDC is only possible if the linker used comes from either the Cray toolchain, the AMD (ROCm) toolchain or any Clang with HIP support based compiler.
Warning
Failure to specify -fgpu-rdp --hip-link
at link time, on a RDC compiled set of object file could result in the following error: error: undefined symbol: __hip_fatbin
.
C preprocessor definitions
HIP and CUDA equivalent definitions |
Meaning |
---|---|
|
(For LLVM based compiler ONLY) It tels you which language the file is to be interpreted as. Respectively defined if |
|
It tels you that you are in the GPU code generation pass. It is useful when one want to make use of conditional compilation to distinguish GPU and CPU code paths. Respectively defined if |
|
Which compiler is being used to build the code. |
|
Should not be used used as the semantic often is not what one wants. Defined in |
We recommend that you use __CUDACC__
and __HIPCC__
to discriminate CUDA and HIP implementations. This will work for any known CUDA and HIP compiler.
Some environment variables
Feature/flag/environnement variable |
Explanation |
---|---|
|
|
|
|
|
|
|
Restrict the devices that your CUDA/HIP application sees. Very important to avoid MPI ranks walking on each other. The AMD runtime supports this environment variable. |
|
Same as |
|
Same as |
|
Same as |
|
The variable controls how many independent hardware queues HIP runtime can create per process, per device. If application allocates more HIP streams than this number, then HIP runtime will reuse the same hardware queues for the new streams in round robin manner. |
|
Disable direct dispatch (do not use a special thread to launch the kernels). |
|
It causes host-to-device and device-to-host copies to use compute shader blit kernels rather than the dedicated DMA copy engines. It has been shown that high HtoD and DtoH copy throughput can be achieved this way. At a cost of less compute power available for computation |
|
Ask |
More details on the AMD GPU debugging options in this document.
Note
While not specific to ROCm, the --save-temps
option is of the upmost importance when one seeks to dive deeper into the machine and assembly code.
HIP compiler flag details
Feature/flag/environnement variable |
Explanation |
---|---|
|
ROCm installation path, used for deriving the |
|
(Generally not used) (Is automatically derived from the |
|
Instruct LLVM based compilers with HIP support to handle advanced HIP linking. Thus is mandatory for RDC. This implicitly link against the current ROCm’s |
The source of truth remains at this location.
Sycl
OneAPI
On Adastra, OneAPI (containing the Intel DPC++ compiler) is exposed through the PrgEnv-intel
and intel
Cray modules. A standalone version is available though the intel-oneapi
module (it does not pull CrayPE). The intel
and amd
modules serve the same role. Similarly for intel-oneapi
and rocm
. The version of DPC++ provided is quite old (2022.1). You may need to recompile it, see below.
Note
Make sure, if you compile your own toolchain, that your ${PATH}
and ${LD_LIBRARY_PATH}
environment variables point towards it!
You can test if a Sycl toolchain is working as expected on MI250 GPUs like so; compile a Sycl hello world:
$ cat << EOF > hello.cc
#include <mpi.h>
#include <iostream>
#include <sycl/sycl.hpp>
int main(int argc, char** argv) {
::MPI_Init(&argc, &argv);
size_t sz = 1000;
sycl::buffer<int> buf(sz);
std::cout << "Device name : "
<< sycl::queue{}.get_device().get_info<sycl::info::device::name>()
<< std::endl;
sycl::queue{}.submit([&](sycl::handler& cgh) {
sycl::accessor acc{buf, cgh, sycl::write_only, sycl::no_init};
cgh.parallel_for(sycl::range<1>{sz}, [=](sycl::item<1> id) {
acc[id] = id.get_linear_id();
});
})
.wait();
sycl::host_accessor acc{buf, sycl::read_only};
std::cout << "expected : 999 | found : " << acc[sz - 1] << "\n";
::MPI_Finalize();
return 0;
}
EOF
$ # NOTE: We use clang++, but your compiler may be called an other name, even if
$ # its clang based.
$ # Our example uses MPI, we have to had the include and linker flags.
$ clang++ -O3 \
-isystem "${CRAY_MPICH_PREFIX}/include" \
-fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx90a \
--rocm-path="${ROCM_PATH}" \
hello.cc -c
$ clang++ -O3 \
-L"${CRAY_MPICH_PREFIX}/lib" -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a} \
-fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx90a \
--rocm-path="${ROCM_PATH}" \
hello.o -o hello
Run the example:
$ srun --account=<account_to_charge> \
--nodes=1 \
--constraint=MI250 \
--job-name="interactive" \
--time=1:00:00 \
--exclusive \
--ntasks-per-node=8 \
-- ./hello | head
srun: job 319271 queued and waiting for resources
srun: job 319271 has been allocated resources
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
Device name : AMD Instinct MI250X
expected : 999 | found : 999
expected : 999 | found : 999
Compiling Intel LLVM
To enjoy the latest release of Intel LLVM you will have to build it yourself. Its embarrassingly easy, as all software build should be ! (right ?):
#!/bin/bash
set -eu
# Cloning the code on a login node.
git clone --branch="sycl" https://github.com/intel/llvm intel-llvm-src || true
cd intel-llvm-src
module purge
module load cpe/23.12
module load cray-python
module load rocm/5.7.1
module list
INSTALL_PATH="$(pwd)/install"
rm -rf ./build ./install
python3 buildbot/configure.py \
--hip \
--cmake-opt="-DCMAKE_C_COMPILER=amdclang" \
--cmake-opt="-DCMAKE_CXX_COMPILER=amdclang++" \
--cmake-opt="-DSYCL_BUILD_PI_HIP_ROCM_DIR=${ROCM_PATH}" \
--cmake-opt="-DCMAKE_INSTALL_PREFIX=${INSTALL_PATH}" \
--cmake-gen="Ninja"
cd build
time ninja -k0 all lib/all tools/libdevice/libsycldevice
time ninja -k0 install
echo "INSTALL_PATH=\"${INSTALL_PATH}\""
Additional detail is given here: https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md
Ensure your environment points towards the toolchain and that a PrgEnv-amd
, amd
, amd-mixed
or rocm
module is loaded:
$ export PATH="${INSTALL_PATH}/bin:${PATH}"
$ export LD_LIBRARY_PATH="${INSTALL_PATH}/lib:${LD_LIBRARY_PATH}"
Note
Ensure that you have recent CMake and Ninja versions. We provide guides on how to setup theses tools here for CMake and here for Ninja.
Note
In the example compilation above we explicitly tell LLVM’s CMake machinery to use the amdclang compiler, you can choose another one.
AdaptiveCPP
Note
hipsycl
was rebrand as AdaptiveCPP for legal reasons.
CINES proposes a hipsycl
installation done using Spack.
$ module spider hipsycl
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
hipsycl: hipsycl/0.9.4-mpi-python3
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
You will need to load all module(s) on any one of the lines below before the "hipsycl/0.9.4-mpi-python3" module is available to load.
CCE-GPU-2.0.0
CCE-GPU-2.1.0
develop CCE-GPU-3.0.0
We do not propose that you use this toolchain to compile Sycl code on Adastra as hipsycl 0.9.4 is quite old. Prefer Intel OneAPI as detailed above.
The transformers
& datasets
libraries
The python transformers
and datasets
libraries are a simple way to download datasets (that are not already available on Adastra’s DSDIR), download models and tokenizers, create inference pipelines and fine tune models.
We believe the datasets and models should preferably be on the scratch or the work storage areas. You can tell the library to cache the datasets and models in a user defined directory like so:
$ export HF_HOME="${SCRATCHDIR}/USER_DSDIR"
Alternatively, you can programmatically, tell the HuggingFace libraries where to cache the data using the cache_dir
argument.
import os
# Store the data in the scratch.
cache_dir = os.path.join(os.environ['SCRATCHDIR'], 'USER_DSDIR')
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('Equall/Saul-Instruct-v1', cache_dir=cache_dir)
model = AutoModelForCausalLM.from_pretrained('Equall/Saul-Instruct-v1', cache_dir=cache_dir)
from datasets import load_dataset
squad_dataset = load_dataset('squad', cache_dir=cache_dir)
If not specified, HF_HOME
or cache_dir
will likely point to your home directory’s .cache/huggingface
folder.
Warning
Using HF_HOME
is not same as using cache_dir=${HF_HOME}
.
You will run your computation on compute nodes who do not have internet access. Use
HF_HUB_OFFLINE
as shown below to avoid timeout during your code start up sequence. This implies you have downloaded your data (using say,from_pretrained
) on a login node, prior to running your computations.
Some environment variables
Feature/flag/environnement variable |
Explanation |
---|---|
|
Specify where to put the cached data. Defaults to |
|
This prevent having to wait for a network connection timeout when using HuggingFace’s libraries (say, |
|
Fine grain version of |
|
Fine grain version of |
|
Transformer library logging level. |
VLLM
Note
You may benefit from using a Python virtual environment, check this document on how to setup one.
Requirements:
A recent install of PyTorch for ROCm (say in a VEnv);
a recent install of flash-attn for ROCm (say in a venv).
You can compile VLLM on Adastra using the commands below.
Copy this patch in a file named rocm.patch
.
--- /include/hip/hcc_detail/amd_hip_bf16.h 2023-11-24 19:20:12.000000000 +0100
+++ include/hip/hcc_detail/amd_hip_bf16.h 2024-03-06 14:18:03.000000000 +0100
@@ -133,7 +133,7 @@
* \ingroup HIP_INTRINSIC_BFLOAT16_CONV
* \brief Converts float to bfloat16
*/
-__HOST_DEVICE__ __hip_bfloat16 __float2bfloat16(float f) {
+__HOST_DEVICE__ inline __hip_bfloat16 __float2bfloat16(float f) {
__hip_bfloat16 ret;
union {
float fp32;
@@ -177,7 +177,7 @@
* \ingroup HIP_INTRINSIC_BFLOAT162_CONV
* \brief Converts and moves bfloat162 to float2
*/
-__HOST_DEVICE__ float2 __bfloat1622float2(const __hip_bfloat162 a) {
+__HOST_DEVICE__ inline float2 __bfloat1622float2(const __hip_bfloat162 a) {
return float2{__bfloat162float(a.x), __bfloat162float(a.y)};
}
@@ -205,7 +205,7 @@
* \ingroup HIP_INTRINSIC_BFLOAT162_CONV
* \brief Convert double to __hip_bfloat16
*/
-__HOST_DEVICE__ __hip_bfloat16 __double2bfloat16(const double a) {
+__HOST_DEVICE__ inline __hip_bfloat16 __double2bfloat16(const double a) {
return __float2bfloat16((float)a);
}
@@ -213,7 +213,7 @@
* \ingroup HIP_INTRINSIC_BFLOAT162_CONV
* \brief Convert float2 to __hip_bfloat162
*/
-__HOST_DEVICE__ __hip_bfloat162 __float22bfloat162_rn(const float2 a) {
+__HOST_DEVICE__ inline __hip_bfloat162 __float22bfloat162_rn(const float2 a) {
return __hip_bfloat162{__float2bfloat16(a.x), __float2bfloat16(a.y)};
}
@@ -243,7 +243,7 @@
* \ingroup HIP_INTRINSIC_BFLOAT162_CONV
* \brief Converts high 16 bits of __hip_bfloat162 to float and returns the result
*/
-__HOST_DEVICE__ float __high2float(const __hip_bfloat162 a) { return __bfloat162float(a.y); }
+__HOST_DEVICE__ inline float __high2float(const __hip_bfloat162 a) { return __bfloat162float(a.y); }
/**
* \ingroup HIP_INTRINSIC_BFLOAT162_CONV
@@ -271,7 +271,7 @@
* \ingroup HIP_INTRINSIC_BFLOAT162_CONV
* \brief Converts low 16 bits of __hip_bfloat162 to float and returns the result
*/
-__HOST_DEVICE__ float __low2float(const __hip_bfloat162 a) { return __bfloat162float(a.x); }
+__HOST_DEVICE__ inline float __low2float(const __hip_bfloat162 a) { return __bfloat162float(a.x); }
/**
* \ingroup HIP_INTRINSIC_BFLOAT162_CONV
With the working directory similar to rocm.patch
, execute:
#!/bin/bash
set -eu
# Cloning the code on a login node.
git clone --branch="v0.3.3" https://github.com/vllm-project/vllm || true
# Build the code on a COMPUTE NODE and installing it in the current Python
# environment.
module purge
module load cpe/23.12
module load craype-accel-amd-gfx90a craype-x86-trento
module load PrgEnv-gnu
module load amd-mixed
module load cray-python
module list
# Use a Python VEnv unless you install everything in your home.
# Source a VEnv if needed.
# source ./python_environment/bin/activate
# Ensure you have installed PyTorch for the correct GPUs.
# VLLM shenanigans require that we modify some ROCm headers.
# Unless this trick is done, we get duplicated symbols such as
# __float2bfloat16 (in fact, they use device function tht are not inline nor
# static).
NEW_ROCM_PATH="$(pwd)/$(basename -- "${ROCM_PATH}")"
if [[ ! -e "${NEW_ROCM_PATH}" ]]; then
cp -rv "$(
cd -- "${ROCM_PATH}" >/dev/null 2>&1
pwd -P
)" "${NEW_ROCM_PATH}" || true
fi
export ROCM_PATH="${NEW_ROCM_PATH}"
export PATH="${ROCM_PATH}/bin:${PATH}"
export HIP_LIB_PATH="${ROCM_PATH}/lib"
export LD_LIBRARY_PATH="${HIP_LIB_PATH}:${LD_LIBRARY_PATH}"
export CMAKE_PREFIX_PATH="${ROCM_PATH}/lib/cmake:${CMAKE_PREFIX_PATH}"
cd -- "${ROCM_PATH}"
patch --strip=1 --backup -f < ../rocm.patch || true
cd ../vllm
pip3 install --no-deps xformers==0.0.23
bash patch_xformers.rocm.sh
pip3 install --upgrade -r requirements-rocm.txt
export PYTORCH_ROCM_ARCH="gfx90a"
pip3 install --verbose --no-cache-dir --no-build-isolation ./
Note
You may have to change the CPE and ROCm (amd-mixed) versions as Adastra evolves.