Tools

CINES Spack modules

CINES provides tools built using Spack. To access these products, check this document and or look at this product catalog.

Building software

CMake

CMake is a meta-build system. It is used to generate build system configuration files (Makefile, Ninja configuration file, etc.). A great documentation on modern CMake is offered by Henry Schhreiner.

CINES provides the latest CMake release available in the OS’s repositories. To get the very latest release, you can build CMake yourself or rely on pip3:

$ pip3 install --user --upgrade cmake
$ cmake --version
3.28.1 # The latest release as of 2024/01/01

Note

Using pip this way assume your .bashrc or .bash_profile correctly defines your path to include ~/.local/bin.

Note

CMake < 3.28 was recognizing CrayClang as a Clang compiler (which it is). With CMake >= 3.28, it recognizes it as CrayClang.

CMake examples

CMake + OpenMP

Assuming you have the following Genoa environment (which is very typical):

$ module purge
$ module load cpe/23.12
$ module load craype-x86-genoa
$ module load PrgEnv-cray

Assuming the following example CMake script:

cmake_minimum_required(VERSION 3.1)
project(example VERSION 1.0 LANGUAGES CXX)

add_executable(example_binary example.cc)

find_package(OpenMP 4.0 REQUIRED)
target_link_libraries(example_binary PRIVATE OpenMP::OpenMP_CXX)

Use the following commands (do not forget the CMAKE_CXX_COMPILER):

$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC ..

When using the Cray wrappers and craype-accel-amd-gfx90a, CMake is known to have some issues with Cray’s GPU offload implementation’s special linker (cce_omp_offload_linker). This causes link time issues or false positive (running on GPU).

Example linker issue:

FAILED: example_binary
: && /opt/cray/pe/craype/2.7.30/bin/CC  -dynamic CMakeFiles/example_binary.dir/example.cc.o -o example_binary   && :
ld.lld: error: undefined symbol: _cray$mt_kmpc_fork_call_with_flags
>>> referenced by example.cc
>>>               CMakeFiles/example_binary.dir/example.cc.o:(main)
clang++: error: linker command failed with exit code 1 (use -v to see invocation)

We propose a work around, assuming you have the following MI250 environment (which is very typical):

$ module purge
$ module load cpe/23.12
$ module load craype-x86-trento craype-accel-amd-gfx90a # <- note the craype-accel-amd-gfx90a
$ module load PrgEnv-cray

Now, at least two situation arise, either you want OpenMP offloading (`#pragma omp target), or you do not. In the second situation, you may have loaded craype-accel-amd-gfx90a for the Cray wrappers to link correctly with a library like MPI, LibSci but do not need OpenMP offloading.

If you are in the first situation you should use CMake like that (force the -fopenmp flag at link time):

$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC -DCMAKE_EXE_LINKER_FLAGS="-fopenmp" ..

If you are in the second situation:

$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC -DCMAKE_LINKER="${CC_X86_64}/bin/cce_omp_offload_linker" ..
CMake + Cray MPICH

Assuming you have the following environment (which is very typical):

$ module purge
$ module load cpe/23.12
$ module load craype-x86-trento craype-accel-amd-gfx90a
$ module load PrgEnv-cray

Assuming the following example CMake script:

cmake_minimum_required(VERSION 3.1)
project(example VERSION 1.0 LANGUAGES CXX)

add_executable(example_binary example.cc)

find_package(MPI REQUIRED)
target_link_libraries(example_binary PRIVATE MPI::MPI_CXX)

Use the following commands (do not forget the CMAKE_CXX_COMPILER):

$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC ..

Had you not used the Cray wrappers (here the raw Cray C++ compiler crayCC), the following commands would have been an alternative:

$ mkdir build && cd build
$ cmake -DCMAKE_CXX_COMPILER=crayCC -DCMAKE_CXX_FLAGS="-isystem ${CRAY_MPICH_PREFIX}/include" -DCMAKE_EXE_LINKER_FLAGS="-L${CRAY_MPICH_PREFIX}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}" ..

Note

The ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a} is only needed if you target GPU.

CMake + Cray HDF5 + Cray wrappers

Assuming you have the following environment (which is very typical):

$ module purge
$ module load cpe/23.12
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ module load PrgEnv-amd
$ module load cray-hdf5

Assuming the following example CMake script:

cmake_minimum_required(VERSION 3.1)
project(example VERSION 1.0 LANGUAGES CXX)

add_executable(example_binary example.cc)

# This variable does not affect the build when using the Cray wrappers!
# set(HDF5_PREFER_PARALLEL TRUE)
find_package(HDF5 REQUIRED COMPONENTS C)
target_link_libraries(example_binary PRIVATE ${HDF5_LIBRARIES})

Use the following commands (do not forget the CMAKE_CXX_COMPILER):

$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC ..

Note

The process is the same for cray-hdf5-parallel, just swap the HDF5 modules.

CMake + Cray HDF5 + raw compilers

Note

There is an issue with the HDF5 wrappers (h5fc, h5cc et h5c++ et h5pcc, h5pfc etc.) provided by Cray. Make sure you workaround the issue as described below.

Assuming you have the following environment (which is very typical):

$ module purge
$ module load cpe/23.12
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ module load PrgEnv-amd
$ module load cray-hdf5

You need to add the following environment variables:

# Work around an issue in FindHDF5 and Cray's HDF5 compiler wrappers.
export HDF5_CC="amdclang"
export HDF5_CLINKER="amdclang"
export HDF5_CXX="amdclang++"
export HDF5_CXXLINKER="amdclang++"
export HDF5_FC="flang"
export HDF5_FLINKER="flang"

Assuming the following example CMake script:

cmake_minimum_required(VERSION 3.1)
project(example VERSION 1.0 LANGUAGES CXX)

add_executable(example_binary example.cc)

find_package(HDF5 REQUIRED COMPONENTS C)
target_link_libraries(example_binary PRIVATE ${HDF5_LIBRARIES})

Use the following commands (do not forget the CMAKE_CXX_COMPILER):

$ mkdir -p build && cd build
$ # The -isystem is to work around an issue in FindHDF5 and Cray's HDF5 compiler
$ # wrappers.
$ cmake -DCMAKE_CXX_COMPILER=amdclang++ -DCMAKE_CXX_FLAGS="-isystem ${CRAY_HDF5_PREFIX}/include" ..

Note

The process is the same for cray-hdf5-parallel, just swap the HDF5 modules.

CMake + Kokkos HIP + OpenMP

Compiling Kokkos with a HIP backend for MI250X and an OpenMP backend for CPUs:

$ module purge
$ module load cpe/23.12
$ module load craype-x86-trento craype-accel-amd-gfx90a
$ module load PrgEnv-cray
$ module load amd-mixed/6.0.0
$ module list

$ git clone -b 4.2.00 https://github.com/kokkos/kokkos.git
$ cd kokkos

$ cmake -B build -S . -DCMAKE_CXX_COMPILER=CC -DCMAKE_CXX_STANDARD=17 -DKokkos_ENABLE_HIP=ON -DKokkos_ARCH_AMD_GFX90A=ON -DKokkos_ENABLE_OPENMP=ON
$ cmake --build build --clean-first --parallel 128
CMake + LibSci

Assuming you have the following environment (which is very typical):

$ module purge
$ module load cpe/23.12
$ module load craype-x86-trento craype-accel-amd-gfx90a
$ module load PrgEnv-cray

Assuming the following example CMake script:

cmake_minimum_required(VERSION 3.1)
project(example VERSION 1.0 LANGUAGES CXX)

add_executable(example_binary example.cc)

find_package(BLAS REQUIRED)
target_link_libraries(example_binary PRIVATE BLAS::BLAS)

Use the following commands (do not forget the CMAKE_CXX_COMPILER):

$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC ..

Had you not used the Cray wrappers (here the raw Cray C++ compiler crayCC), the following commands would have been an alternative:

$ mkdir build && cd build
$ cmake -DCMAKE_CXX_COMPILER=crayCC -DCMAKE_CXX_FLAGS="-isystem ${CRAY_PE_LIBSCI_PREFIX}/include" -DBLAS_LIBRARIES="${CRAY_PE_LIBSCI_PREFIX}/lib/libsci_cray.so;" ..

Ninja

Ninja is yet another build system. It takes as input the interdependencies of files (typically source code and output executables) and orchestrates building them, quickly. It plays a similar role to GNU Make but without most of the clutter. It is simpler, faster, more reliable. You should not write Ninja configuration files like you may do for Makefiles. Instead, rely on a meta-build system such as CMake, Meson, etc..

To install Ninja on any machine (assuming you have pip3 installed):

$ pip3 install --user --upgrade ninja

To ask CMake to use Ninja:

$ cmake -GNinja <your_other_flags>

Conda

CINES proposes a conda module:

$ module load conda

Note

CINES uses a package cache.

Debugging

GDB

GDB is the defacto, system programming debugger. It requires quite a lot of skill to get anything out of it on real world problems (notably if you do not have access to the source). If you plan to use it, make sure that, regardless of the level of optimization you specify to your compiler, you ship the debug info too. This is generally done using -ggdb on GNU and Clang based compilers.

Note

If you are a novice, you may be better off using ChatGDB with a prompt such as What the root cause of this error is? Given the following source code context and propose a fix. In your response, always refer only to specific lines and filenames of source code..

Valgrind

Valgrind is a suite of tools for debugging and profiling programs. It is mostly used to profile cache behavior (by emulation) and check for memory leaks.

To check for memory leaks:

$ valgrind -- ./my_program

Sanitizers

Compiler toolchains such as GNU or LLVM provide sanitizers (runtime advanced diagnostics) for the C and C++ compilers. On Clang or GCC based compilers, you would use -fsanitize=address to check for memory leaks. There are more sanitizer available for undefined behavior, race condition, or uninitialized memory usage. Check this document for more details.

ldd and nm

ldd prints shared library dependencies. Can be used to check if all shared object dependencies are satisfied. ldd is recursive and does not stop at the first level of the dependency tree.

nm lists symbols from object files. This is useful to diagnose undefined reference issue. you have to de-mangle C++ symbols. For that, you can pipe nm into the c++filt h command.

gstack

Prints a stack trace of a running process. Can be used in conjunction with watch to observe how the stack evolves.

watch -n0 -- "gstack <pid>"

Monitoring

htop

An interactive process viewer. This is useful to monitor how your program uses the machine.

Warning

When you launch this tool for the first time, you might get surprised by the layout due to the unfortunately huge amount of cores on Adastra’s login and Genoa nodes. This requires some layout tuning to get a proper reading. Check the configuration below to workaround the issue.

Example configuration to place into ~/.config/htop/htoprc

# Beware! This file is rewritten by htop when settings are changed in the interface.
# The parser is also very primitive, and not human-friendly.
fields=0 48 17 114 18 38 39 40 2 113 46 47 49 1
sort_key=46
sort_direction=1
tree_sort_key=49
tree_sort_direction=1
hide_kernel_threads=1
hide_userland_threads=1
shadow_other_users=0
show_thread_names=0
show_program_path=0
highlight_base_name=1
highlight_megabytes=1
highlight_threads=1
highlight_changes=0
highlight_changes_delay_secs=5
find_comm_in_cmdline=1
strip_exe_from_cmdline=1
show_merged_command=0
tree_view=1
tree_view_always_by_pid=0
header_margin=1
detailed_cpu_time=1
cpu_count_from_one=0
show_cpu_usage=1
show_cpu_frequency=0
update_process_names=0
account_guest_in_cpu_meter=0
color_scheme=0
enable_mouse=1
delay=15
left_meters=LeftCPUs8 Memory Swap NetworkIO DiskIO
left_meter_modes=1 1 1 2 2
right_meters=RightCPUs8 Hostname Tasks LoadAverage Uptime SELinux
right_meter_modes=1 2 2 2 2 2
hide_function_bar=0

rocmsmi

Note

It is somewhat equivalent to NVIDIA’s nvidia-smi.

One can load the amd-mixed module and use the rocm-smi command. A typical workflow would be to start a job using say, sbatch, do an squeue --me to obtain the list of node associated to the job. Select a node from the list and connect to it using ssh <node_hostname>. Do module load amd-mixed and watch -n0 -- rocm-smi.

It will give you the following output:

$ watch -- rocm-smi
Every 0.1s: rocm-smi                            <node>: <date>

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr  SCLK     MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    63.0c  444.0W  1700Mhz  1600Mhz  0%   auto  560.0W   30%   100%
1    58.0c  N/A     1700Mhz  1600Mhz  0%   auto  0.0W     30%   100%
2    63.0c  465.0W  1700Mhz  1600Mhz  0%   auto  560.0W   30%   100%
3    69.0c  N/A     1700Mhz  1600Mhz  0%   auto  0.0W     30%   100%
4    58.0c  443.0W  1700Mhz  1600Mhz  0%   auto  560.0W   30%   100%
5    57.0c  N/A     1700Mhz  1600Mhz  0%   auto  0.0W     30%   100%
6    68.0c  460.0W  1700Mhz  1600Mhz  0%   auto  560.0W   30%   100%
7    64.0c  N/A     1700Mhz  1600Mhz  0%   auto  0.0W     30%   99%
================================================================================
============================= End of ROCm SMI Log ==============================

In addition is can be a trove of information regarding the hardware topology (BUS, PCIe, etc.).

Profiling

Intel VTune & Advisor

CINES proposes both softwares under the following paths:

/opt/software/intel/oneapi/advisor/latest/bin64/advisor-gui
/opt/software/intel/oneapi/vtune/latest/bin64/vtune-gui

Their UI can be used via X11 forwarding.

Cray-PAT’s perftools

Introducing Cray-PAT can be done in multiple ways. First ensure the perftools-base module is loaded. Then load a perftools-lite* modules. They exist in flavor variation such as:

  • perftools-lite: Gives execution time, memory peak, most time-consuming functions, I/O read and write rates, and total energy consumed.

  • perftools-lite-events: Gives the time-consuming functions, MPI and OpenMP (CPU) information, as well as node observations and possible rank order suggestions.

  • perftools-lite-gpu: Gives kernel execution times, synchronization time, as well as data movement information (time and size of transfers).

  • perftools-lite-loops: Gives loop trip counts and execution times. The results of this profile can be used with Reveal. More on that Later.

Note

Cray provides some documentation on using theses tools.

when the modules given above is loaded, recompile your application using the Cray wrappers. After executing the test case, a folder is created in the working directory in use when you launched the application. At the end of the execution, perftools also print partial results to the standard output. A sample of such output could look like this:

#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################
CrayPat/X:  Version 22.04.0 Revision 044b9fa85  03/10/22 21:59:07
Experiment:                  lite  lite-samples
Number of PEs (MPI ranks):      1
Numbers of PEs per Node:        1
Numbers of Threads per PE:      1
Number of Cores per Socket:    64
Accelerator Model: AMD MI100 Memory: 32.00 GB Frequency: 1.00 GHz
Execution start time:  Tue May 31 10:23:14 2022
System name and speed:  g1003  2.304 GHz (nominal)
AMD   Trento               CPU  Family: 25  Model: 48  Stepping:  1
Core Performance Boost:  1 PE has CPB capability
Avg Process Time:      57.62 secs
High Memory:         2,598.3 MiBytes     2,598.3 MiBytes per PE
I/O Write Rate:   108.928487 MiBytes/sec
Notes for table 1:
This table shows functions that have significant exclusive sample
hits, averaged across ranks.
For further explanation, see the "General table notes" below,
or use:  pat_report -v -O samp_profile ...
Table 1:  Profile by Function
Samp% |    Samp | Imb. |  Imb. | Group
|         | Samp | Samp% |  Function=[MAX10]
100.0% | 2,234.0 |   -- |    -- | Total
|---------------------------------------------------------
|  67.0% | 1,497.0 |   -- |    -- | ETC
||--------------------------------------------------------
||  36.5% |   815.0 |   -- |    -- | __sci_sgemv_
||  16.0% |   357.0 |   -- |    -- | openblas_sgemv_t_naples
||   8.6% |   192.0 |   -- |    -- | sgemv_kernel_4x2
||   3.4% |    76.0 |   -- |    -- | cblas_sgemv
||   1.9% |    42.0 |   -- |    -- | sgemv_kernel_4x1
||========================================================
|  32.9% |   735.0 |   -- |    -- | USER
||--------------------------------------------------------
||  26.3% |   588.0 |   -- |    -- | main
||   2.3% |    52.0 |   -- |    -- | matrix
||   2.2% |    49.0 |   -- |    -- | WriteHdf5MeshFile2D
||   1.2% |    27.0 |   -- |    -- | Compute_Epsi
|=========================================================

You can observe in which function the time is spent. The time presented here, in percentage and samples, is exclusive. Meaning that, if you imagine the history if the call stack as a rooted tree (graph theory) the time spent in a function (a vertex) does not account for the time spent in the callee (children vertices).

In the example shown above, most of the time is spent in __sci_sgemv_ a BLAS API to compute a matrix-vector operation.

Instrumenting an already existing executable binary

Cray offers the pat_build command that can be used to instrument already compiled codes. Still, one must make sure that the perftools-base and perftools modules were loaded during compilation. Then, to instrument the binary, one case do the following:

$ pat_build -g <trace_group> -f <path_to_original_binary> -o <path_to_instrumented_binary>

You are to specify what you would like to profile using a trace-group. The values it can take are given below (you can use multiple separated by a comma):

adios2
    Adaptable Input Output System Version 2
aio
    Functions that perform asynchronous I/O.
blacs
    Basic Linear Algebra communication subprograms
blas
    Basic Linear Algebra subprograms
caf
    Co-Array Fortran (CCE compiler only)
charm++
    Charm++ independent parallel programming system (deprecated)
comex
    Communications Runtime for Extreme Scale
converse
    Charm++ Converse low-level message passing API (deprecated)
craymem
    Cray memory allocation with attributes
cuda
    NVidia Compute Unified Device Architecture runtime and driver API
curl
    Multi-protocol file transfer API
dl
    functions that manage dynamic linking
dmapp
    Distributed Memory Application API
fabric
    Open network communication services API
ffio
    functions that perform Flexible File I/O (CCE compiler only)
fftw
    Fast Fourier Transform library (32- and 64-bit only)
ga
    Global Arrays API
gmp
    GNU MultiPrecision Arithmetic Library
gni
    Generic Network Interface API
hbw
    High Bandwidth Memory API
hdf5
    Hierarchical Data Format library
heap
    dynamic heap
hip
    AMD Heterogeneous-compute Interface for Portability runtime API
hsa
    AMD Heterogeneous System Architecture API
huge
    Linux huge pages
io
    functions and system calls that perform I/O
jemalloc
    heap allocation emphasizing fragmentation avoidance and scalable concurrency
lapack
    Linear Algebra Package
lustre
    Lustre User API
math
    POSIX.1 math functions
memkind
    User extensible heap manager API
memory
    memory management operations
mpfr
    GNU MultiPrecision Floating-Point Library
mpi
    Message Passing Interface library
netcdf
    Network Common Data Form
numa
    Non-uniform Memory Access API (see numa(3))
oacc
    OpenAccelerator API
omp
    OpenMP API
opencl
    Open Computing Language API
pblas
    Parallel Basic Linear Algebra Subroutines
petsc
    Portable Extensible Toolkit for Scientific Computation. Supported for "real" computations only.
pgas
    Parallel Global Address Space
pnetcdf
    Parallel Network Common Data Form
pthreads
    POSIX threads
pthreads_mutex
    POSIX threads concurrent process control
pthreads_spin
    POSIX threads low-level synchronization control
realtime
    POSIX realtime extensions
scalapack
    Scalable LAPACK
shmem
    One-sided Remote Direct Memory Access Parallel-Processing Interface library
signal
    POSIX signal handling and control
spawn
    POSIX realtime process creation
stdio
    all library functions that accept or return the FILE* construct
string
    String operations
syscall
    system calls
sysfs
    system calls that perform miscellaneous file management
sysio
    system calls that perform I/O
umpire
    Heterogeneous Memory Resources Management Library
upc
    Unified Parallel C (CCE compiler only)
xpmem
    cross-process memory mapping
zmq
    High-performance asynchronous messaging API

For CPU profiling, a typical set of trace-group would be mpi,syscall,io,omp and for AMD GPUs: mpi,syscall,io,hip.

When the instrumented_binary is produced, you can use it similarly to the original binary. Profiling results will be output in results_path of the form instrumented_binary+<PID>_<timestamp>s. The folder is placed in the working directory of the executable.

You can finally use pat_report -v <results_path> and optionally, specify an option to orient what to report: pat_report -v -O <report_option> <results_path>. the report options are given below. Note that not all report options are available for a given profile. Indeed, you have to record the appropriate information first to be able to report it. Do not expect MPI information from an OpenMP profile.

accelerator
    Show calltree of accelerator performance data sorted by host time.
accpc
    Show accelerator performance counters.
acc_fu
    Show accelerator performance data sorted by host time.
acc_time_fu
    Show accelerator performance data sorted by accelerator time.
acc_time
    Show calltree of accelerator performance data sorted by accelerator time.
acc_show_by_ct
    (Deferred implementation) Show accelerator performance data sorted alphabetically.
affinity
    Shows affinity bitmask for each node. Can use -s pe=ALL and -s th=ALL to see affinity for each process and thread, and use -s filter_input=expression to limit the number of PEs shown.
profile
    Show data by function name only
callers (or ca)
    Show function callers (bottom-up view)
calltree (or ct)
    Show calltree (top-down view)
ca+src
    Show line numbers in callers
ct+src
    Show line numbers in calltree
hbm_ct
    Show memory bandwidth data by object, sorted by sample count.
hbm_details
    Show hbm data collection statistics, including counts of sampled addresses that could not be mapped to a registered object.
hbm_frees
    Show program locations at which objects are freed by explicit calls to free or delete.
hbm_wt
    Show memory bandwidth data by object, sorted by aggregate sample weight. The weight estimates the benefit of allocating the object in high bandwidth memory.
heap
    Implies heap_program. heap_hiwater, and heap_leaks. Instrumented executables must be built using the pat_build -g heap option or executed with the pat_run -g heap option in order to show heap_hiwater and heap_leaks information.
heap_program
    Compare heap usage at the start and end of the program, showing heap space used and free at the start, and unfreed space and fragmentation at the end.
heap_hiwater
    If the pat_build -g heap option was used to instrument the program or the program was executed with the pat_run -g heap option, this report option shows the heap usage "high water" mark, the total number of allocations and frees, and the number and total size of objects allocated but not freed between the start and end of the program.
heap_leaks
    If the pat_build -g heap option was used to instrument the program or the program was executed with the pat_run -g heap option, this report option shows the largest unfreed objects by call site of allocation and PE number.
himem
    Memory high water mark by Numa Node. For nodes with multiple sockets, or nodes with Intel KNL processors, the default report should also have a table showing high water usage by numa node. That table is not shown if all memory was mapped to numa node 0, but can be explicitly requested with pat_report -O himem.
acc_kern_stats
    Show kernel-level statistics including average kernel grid size, average block size, and average amount of shared memory dynamically allocated for the kernel.
load_balance
    Implies load_balance_program, load_balance_group, and load_balance_function. Show PEs with maximum, minimum, and median times.
load_balance_program, load_balance_group, load_balance_function
    For the whole program, groups, or functions, respectively, show the imb_time (difference between maximum and average time across PEs) in seconds and the imb_time% (imb_time/max_time * NumPEs/(NumPEs - 1)). For example, an imbalance of 100% for a function means that only one PE spent time in that function.
load_balance_cm
    If the pat_build -g mpi option was used to instrument the program or the program was executed with the pat_run -g mpi option, this report option shows the load balance by group with collective-message statistics.
load_balance_sm
    If the pat_build -g mpi option was used to instrument the program or the program was executed with the pat_run -g mpi option, this report option shows the load balance by group with sent-message statistics.
load_imbalance_thread
    Shows the active time (average over PEs) for each thread number.
loop_times
    Inclusive and Exclusive Time in Loops. If the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used, then this table will be included in a default report and the following additional loop reporting options are also available.
loop_callers
    Loop Stats by Function and Caller. Available only if the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used.
loop_callers+src
    Loop Stats by Function and Callsites. Available only if the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used.
loop_calltree
    Function and Loop Calltree View. Available only if the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used.
loop_calltree+src
    Function and Loop Calltree with Line Numbers. Available only if the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used.
profile_loops
    Profile by Group and Function with Loops. Available only if the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used.
mcdram
    Display the MCDRAM configuration for each PE with pat_report -O mcdram ...
mesh_xyz
    Show the coordinates in the network mesh.
mpi_callers
    Show MPI sent- and collective-message statistics
mpi_sm_callers
    Show MPI sent-message statistics
mpi_coll_callers
    Show MPI collective-message statistics
mpi_dest_bytes
    Show MPI bin statistics as total bytes
mpi_dest_counts
    Show MPI bin statistics as counts of messages
mpi_sm_rank_order
    Calculate a suggested rank order based on MPI grid detection and MPI point-to-point message optimization. Uses sent-message data from tracing MPI functions to generate suggested MPI rank order information. Requires the program to be instrumented using the pat_build -g mpi option or executed with the pat_run -g mpi option.
mpi_rank_order
    Calculate a rank order to balance a shared resource such as USER time over all nodes. Uses time in user functions, or alternatively, any other metric specified by using the -s mro_metric options, to generate suggested MPI rank order information.
mpi_hy_rank_order
    Calculate a rank order based on a hybrid combination of mpi_sm_rank_order and mpi_rank_order.
nids
    Show PE to NID mapping.
nwpc
    Program network performance counter activity.
profile_nwpc
    Network performance counter data by Function Group and Function. Table shown by default if NWPCs are present in the .ap2 file.
profile_pe.th
    Show the imbalance over the set of all threads in the program.
profile_pe_th
    Show the imbalance over PEs of maximum thread times.
profile_th_pe
    For each thread, show the imbalance over PEs.
program_time
    Shows which PEs took the maximum, median, and minimum time for the whole program.
read_stats, write_stats
    If the pat_build -g io option was used to instrument the program or the program was executed with the pat_run -g io option, these options show the I/O statistics by filename and by PE, with maximum, median, and minimum I/O times. The -O io option is a shortcut for both read_stats and write_stats.
samp_profile+src
    Show sampled data by line number with each function.
thread_times
    For each thread number, show the average of all PE times and the PEs with the minimum, maximum, and median times.

Omniperf

Warning

Ensure you use a profiler from a ROCm in version equal or more recent than the ROCm used to build the binary.

Note

We provide slides given in 2024 on using this tool.

Note

Can be used go into the nitty-gritty details of the GPU kernels. Its at a lower level than Omnitrace.

MI Performance Profiler (Omniperf) is a system performance profiling tool for Machine Learning and HPC workloads running on AMD Instinct Accelerators (MI200, MI300, etc.). It is currently built on top of the rocProfiler to monitor hardware performance counters. Basically it is a pretty interface to rocPROF.

It offers the following analysis related features:

  • System information;

  • System Speed-of-Light (SOL);

  • Kernel statistic;

  • Memory chart analysis;

  • Roofline analysis (supported on MI200 only, SLES 15 SP3 or RHEL8);

  • Command Processor (CP);

  • Shader Processing Input (SPI);

  • Wavefront launch;

  • Compute Unit (CU) - instruction mix;

  • Compute Unit (CU) - pipeline;

  • Local Data Share (LDS);

  • Instruction cache;

  • Scalar L1D cache;

  • Texture addresser and data;

  • Vector L1D cache;

  • L2 cache;

  • L2 cache (per-channel).

On Adastra, we provide a module (currently hidden) which you can load like so:

$ module load rocm/5.5.1
$ module load .omniperf/1.0.10

To use the tool, please refer to their official documents https://amdresearch.github.io/omniperf/getting_started.html#quickstart. A good video introduction to the tool is given by Cole Ramos in GPU Profiling (Performance Profile: Omniperf).

GPU roofline

Omniperf uses rocPROF to build rooflines graphs for you. It can be used like so:

$ omniperf profile -n case.roofline --roof-only --device=0 -- ./test_binary

It will start the test_binary as many time as needed for all the counter to be retrieved and run exclusively on the device 0. When the process finishes, you will find PDF files in <working_directory>/workload/case.roofline/. These documents are roofline graph.

Note

Be careful, Omniperf will run your binary multiple times. Ensure that there is reproducibility between runs.

Omnitrace

Warning

Ensure you use a profiler from a ROCm in version equal or more recent than the ROCm used to build the binary.

Note

We provide slides given in 2024 on using this tool.

Note

Closer to a typical profiler (say VTune, perf) with additional AMD GPU kernel details.

MI Performance Tracer (Omnitrace) is designed for both high-level profiling and comprehensive tracing of applications running on the CPU or the CPU+GPU via dynamic binary instrumentation, call-stack sampling, and various other means for determining currently executing function and line information.

Visualization of the comprehensive Omnitrace results can be viewed in any modern web browser by visiting ui.perfetto.dev and loading the perfetto output (.proto files) produced by Omnitrace.

It offers the following analysis related features:

  • Dynamic instrumentation; - Runtime instrumentation; - Binary rewriting;

  • Statistical sampling;

  • Process-level sampling;

  • Causal profiling;

  • High-level summary profiles with mean/min/max/stddev statistics;

  • Application speedup predictions resulting from potential optimizations in functions and lines of code (causal profiling);

  • Critical trace generation;

  • Critical trace analysis;

  • HIP;

  • HSA;

  • Pthreads;

  • MPI;

  • Kokkos-Tools (KokkosP);

  • OpenMP-Tools (OMPT);

  • GPU hardware counters;

  • HIP API tracing;

  • HIP kernel tracing;

  • HSA API tracing;

  • HSA operation tracing;

  • System-level sampling (via rocm-smi);

  • CPU hardware counters sampling and profiles;

  • CPU frequency sampling;

  • Various timing metrics;

  • Various memory metrics;

  • Network statistics;

  • I/O metrics;

  • Third-party API support:
    • TAU;

    • LIKWID;

    • Caliper;

    • CrayPAT;

    • VTune;

    • NVTX;

    • ROCTX.

On Adastra, we provide a module (currently hidden) which you can load like so:

$ module load rocm/5.5.1
$ module load .omnitrace/1.10.4

To use the tool, please refer to their official documents https://amdresearch.github.io/omnitrace/getting_started.html.

perf

perf is the performance analysis tool for Linux. It is developed as part of the Linux kernel. This tools is simple to use, lightweight and versatile. You can record traces of many kind and exploit them using specific tools such as the hotspot visualizer.

Note

If you want to profile a specific rank of an MPI based, multi rank program, use a trivial shell wrapper to profile only if the SLURM_PROCID is equal to the rank you wish to profile: if [ "${SLURM_PROCID}" == "0" ]; then exec -- perf -- <my_program>; else exec -- <my_program>; fi.

Assuming you want to profile a program my_program:

$ perf record -o perf.data --call-graph dwarf --event instructions,cpu-cycles,cache-misses,branches --aio --sample-cpu -- ./my_program

rocPROF

Warning

Ensure you use a profiler from a ROCm in version equal or more recent than the ROCm used to build the binary.

Note

We provide slides given in 2024 on using this tool.

rocprof is the performance analysis tool for GPGPU oriented AMD GPUs.

Note

When doing an initial profiling, you may benefit from using export AMD_SERIALIZE_COPY=3 AMD_SERIALIZE_KERNEL=3 GPU_MAX_HW_QUEUES=1. This will force kernels to be serialized on one physical stream which may ease resource usage analysis of the kernels and generally gives a cleaner trace.

We recommend that you use a small script to wrap rocPROF:

#!/bin/bash

if [ "${SLURM_PROCID}" == "0" ]; then
    # Here we prepare a simple GPU profiling, nothing fancy.
    exec -- rocprof --stats --timestamp on -o stats_${SLURM_JOBID}-${SLURM_PROCID}.csv "${@}"
else
    exec -- "${@}"
fi

This scripts serves to select which rank is to be profiled. But also, to properly label the trace data to disk using SLURM_PROCID.

Warning

In the script above, only the first rank (0) is going to be profiled. Change the "${SLURM_PROCID}" == "XXX" if needed.

Simple GPU profiling

This tools can be used to gather GPU kernel statistics.

For rocPROF to be available in your environment, the correct amd-mixed or rocm module (if you did not use PrgEnv-amd). We recommend the rocPROF built in ROCm versions equal or more recent than 5.3.0. Earlier rocPROF versions had crippling bugs.

The first step is to map the landscape. This can be done by asking rocPROF to produce simple statistics about all the kernel like so:

$ rocprof --stats --timestamp on -o stats_${SLURM_JOBID}-${SLURM_PROCID}.csv <executable> <arguments>

This will produce a CSV output composed of the following significant files: stats_<slurm_job_id>-<slurm_mpi_rank>.csv and stats_<slurm_job_id>-<slurm_mpi_rank>.stats.csv. This first files represents the list of all information gather for each kernel launched.

Sample of the first file (.csv) after formatting the CSV as a table:

Index

KernelName

gpu-id

queue-id

queue-index

pid

tid

grd

wgr

lds

scr

vgpr

sgpr

fbar

sig

obj

DispatchNs

BeginNs

EndNs

CompleteNs

DurationNs

0

void thrust::hip_rocprim::__parallel_for::kernel<256u, 1u, thrust::hip_rocprim:: … [clone .kd]

0

1

0

796946

796946

11796480

256

0

0

8

24

94656

0x0

0x7f1e35826300

1191607302876252

1191607302920469

1191607303043669

1191607303053971

123200

1

void rocprim::detail::scan_batches_kernel<256u, 2u, 6u>(unsigned int*, unsigned int*, unsigned int) [clone .kd]

0

1

48

796946

796946

16384

256

512

0

16

24

36864

0x0

0x7f1e35825bc0

1191607404624006

1191607404963922

1191607404970162

1191607404973843

6240

2

__omp_offloading_30_40e0070__ZN5Field6put_toEd_l21_cce$noloop$form.kd

0

0

4

796946

796946

68352

256

0

0

4

24

68224

0x0

0x7f1e2fdc8240

1191607515378464

1191607515391455

1191607515400735

1191607515402559

9280

Sample of the second file (.stats.csv) after formatting the CSV as a table:

Name

Calls

TotalDurationNs

AverageNs

Percentage

void hip::kernel::DepositCurrentDensity_2D_Order2<double, float, 128ul>(double*, … [clone .kd]

600

8559183476

14265305

50.06434641750023

void rocprim::detail::partition_kernel<(rocprim::detail::select_method)1, true, … [clone .kd]

600

1260501300

2100835

7.372919849172466

void rocprim::detail::partition_kernel<(rocprim::detail::select_method)1, true, … [clone .kd]

600

1247304037

2078840

7.29572646402685

__omp_offloading_30_40e035b__ZN20Interpolator2D2Order13fieldsWrapperEP11ElectroMagnR9ParticlesP9MPIPiS6_iji_l186_cce$noloop$form.kd

600

1245721159

2076201

7.2864678994977865

__omp_offloading_30_40e0060__ZN11PusherBorisclER9ParticlesP9MPIiiii_l57_cce$noloop$form.kd

600

1040969461

1734949

6.088834974933595

From theses two sample tables we can note some interesting information such as:

Calls - How many time a given kernel has been launched.
DurationNs - Execution duration for a given kernel launch.
AverageNs - Average kernel duration in nanosecond (=sum(DurationNs of a given kernel) / Calls of a given kernel).
Percentage - Percentage of the GPU time (not user, cpu or system time) spent on a given kernel (=sum(DurationNs of a given kernel) / sum(DurationNs of all kernels).
grd - Size of the grid (in thread/work-item) (=wgr*grp_count).
wgr - Size of the workgroup (in thread/work-item) (=grd/grp_count).
lds - Amount of LDS used in octet.
scr - Amount of scratch memory space usage in octet.
vgpr - Amount of vector general purpose register (a GPR is 4 octets wide).
sgpr - Amount of scalar general purpose register (a GPR is 4 octets wide).

Note

The rocPROF overhead is not very high (~ 5-10%) when only asking for basic statistics (--stats) but the amount of data can grow very fast!

Note

rocPROF gather data on every kernel launched on a given GPU, that is, your kernels or other’s kernels. Make sure only your code is executing on it, else you will get noisy reports.

Precise GPU profiling

This topic can become very advanced quite quickly, here we only present a way to gather information, not how to interpret the results which, depending on your load could differ. rocPROF can be fed with a -i <file_name> option. This file_name point to a configuration file inside which we can specify a list of hardware counters or metrics (based on hardware counters) sampled at runtime. One can not ask to log all the hardware counters due to hardware limitations.

Here is such a configuration file with often relevant metrics:

pmc : VALUUtilization VALUBusy L2CacheHit LDSBankConflict ALUStalledByLDS

Where the metrics signify:

VALUUtilization - The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).
VALUBusy - The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
L2CacheHit - The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).
LDSBankConflict - The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).
ALUStalledByLDS - The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).

More information can be found on the rocPROF Github page and by reading the documentation on the available metrics (also see $ rocprof --list-basic and $ rocprof --list-derived.

Note

A detailed list of available hardware counters and their respective meaning is available here: https://rocm.docs.amd.com/en/latest/conceptual/gpu-arch/mi200-performance-counters.html

Note

One can choose to restrain the set of kernel on which data should be collected by adding the line: kernel: <kernel_symbol_00>. Where kernel_symbol_00 is the kernel name (you can get it from a rocPROF .stats.csv file if unknown due to, say OpenMP). You can restrain the set to more than one kernel.

GPU roofline

The roofline model enables the developer to quickly assert how far from the theoretical peak performance a specific piece of code is.

Effectively, it plots the floating point throughput (Flop/s) as a function of the arithmetic intensity of kernels. The arithmetic intensity represents the amount of Flop per unit of data loaded from (Flop/octet), typically, cache or main memory.

Then, one derives a theoretical peak performance based on the capacities of the hardware. The shape of this upper bound performance has the shape of a roofline, thus the name.

In practice we have multiple rooflines, one for Binary64 (double), one for Binary32 (float) etc..

In case of the MI250X cards and as an example, we can derive, for a GCD, a Binary64 roofline based on the peak theoretical memory throughput \(\textrm{B} = 1.6\) Tio/s and a peak Flop/s \(\textrm{F} = 23.9\) TFlop/s. The memory throughput, like the peak Flop/s can also be experimentally derived using benchmarks.

\[\begin{split}{\displaystyle \textrm{P} = \min {\begin{cases} \textrm{F} \\ \textrm{B} \times \textrm{I} \end{cases}}}\end{split}\]

Plotting \(\textrm{P}\) as a function of \(\textrm{I}\) gives:

../../_images/rocprof_roofline_0.png

Then using rocPROF and specific hardware counters, we can obtain, for each kernel, the arithmetic intensity and its associated floating point throughput. Under the assumption that one does not mix floating point precision inside a given kernel, we provide the following rocPROF configuration files for Binary64 (double):

pmc : TCC_EA_RDREQ_32B_sum TCC_EA_RDREQ_sum TCC_EA_WRREQ_sum TCC_EA_WRREQ_64B_sum SQ_INSTS_VALU_ADD_F64 SQ_INSTS_VALU_MUL_F64 SQ_INSTS_VALU_FMA_F64 SQ_INSTS_VALU_TRANS_F64 SQ_INSTS_VALU_MFMA_MOPS_F64

For Binary32 (float) we would have:

pmc : TCC_EA_RDREQ_32B_sum TCC_EA_RDREQ_sum TCC_EA_WRREQ_sum TCC_EA_WRREQ_64B_sum SQ_INSTS_VALU_ADD_F32 SQ_INSTS_VALU_MUL_F32 SQ_INSTS_VALU_FMA_F32 SQ_INSTS_VALU_TRANS_F32 SQ_INSTS_VALU_MFMA_MOPS_F32

And for Binary16 (half) and Brain16 we would have:

pmc : TCC_EA_RDREQ_32B_sum TCC_EA_RDREQ_sum TCC_EA_WRREQ_sum TCC_EA_WRREQ_64B_sum SQ_INSTS_VALU_ADD_F16 SQ_INSTS_VALU_MUL_F16 SQ_INSTS_VALU_FMA_F16 SQ_INSTS_VALU_TRANS_F16 SQ_INSTS_VALU_MFMA_MOPS_F16 SQ_INSTS_VALU_MFMA_MOPS_BF16

The data gathered using rocPROF and the counters above needs to be further processed to obtain the kernel’s arithmetic intensity \(\textrm{I}\) and the Flop/s \(\textrm{F}\).

\[\textrm{Load} = 32 \times \textrm{TCC_EA_RDREQ_32B_sum} + 64 \times (\textrm{TCC_EA_RDREQ_sum} - \textrm{TCC_EA_RDREQ_32B_sum})\]
\[\textrm{Store} = 64 \times \textrm{TCC_EA_WRREQ_64B_sum} + 32 \times (\textrm{TCC_EA_WRREQ_sum} - \textrm{TCC_EA_WRREQ_64B_sum})\]
\[\textrm{MFMA} = 256 \times \textrm{SQ_INSTS_VALU_MFMA_MOPS_F64}\]
\[\textrm{Scalar} = 64 \times (\textrm{SQ_INSTS_VALU_ADD_F64} + \textrm{SQ_INSTS_VALU_MUL_F64} + 2 \times \textrm{SQ_INSTS_VALU_FMA_F64} + \textrm{SQ_INSTS_VALU_TRANS_F64})\]
\[\textrm{Flop} = \textrm{MFMA} + \textrm{Scalar}\]
\[\textrm{I} = (\textrm{Load} + \textrm{Store}) / \textrm{Flop}\]
\[\textrm{F} = \textrm{Flop} / (\textrm{EndNs} - \textrm{BeginNs})\]

If one seeks to analyze the Binary32 data, only the \(\textrm{Flop}\) computation needs to change (use the correct counters by replacing 64 by 32). Note that for the MFMA operations, in 32 and 64 bits floats, represents 256 floating point operations, and for 16 bit floats it represents 1024 operations. Also, for 16 bits floats, there is SQ_INSTS_VALU_MFMA_MOPS_BF16 to take into account in addition to SQ_INSTS_VALU_MFMA_MOPS_F16.

You can then plot the the points for each kernel, giving:

../../_images/rocprof_roofline_1.png

Note

You could also use the LDS, L1 or L2 cache throughput to build the roofline but you need to replace the HBM usage counter with the appropriate LDS, LI or L2 counters. Do not compare a measured HBM throughput and AI to a roofline other than the HBM one.

rocPROFv2

rocprofv2 is a wrapper around rocprof. It was conceived to simplify trace generation. It can be used like so:

$ rocprofv2 --[hip-trace|hsa-trace|roctx-trace|kernel-trace|sys-trace] <executable>

Python

Virtual environment

To avoid sullying you home with unrelated Python packages, you can make use of Python’s virtual environment (VEnv). Here is a simple script showing you how to do that properly on Adastra:

#!/bin/bash

# enable only if you do not source this script.
# set -eu

module purge

module load cpe/23.12
module load cray-python

module list

python3 -m pip install --user --upgrade pip
pip3 install --user --upgrade virtualenv
python3 -m virtualenv ./python_environment
chmod +x ./python_environment/bin/activate
source ./python_environment/bin/activate
python3 -m pip install --upgrade pip

Then to activate the virtual environment, do:

$ source ./python_environment/bin/activate

When the virtual environment is activated you can freely install packages, for instance:

$ pip3 install transformers datasets

Note

To deactivate the virtual environment, simply enter deactivate in the command line.

Warning

Be careful with the cray-python module, it defines the following variables: ${PYTHONPATH}/${PYTHON_PATH} which changes some VEnv library path. Generally, once the VEnv is setup, you DO NOT want to load the module anymore.

Trivia

ClangFormat

ClangFormat is a tool that enables automatic source code formatting. It is based on Clang (parsing).

ClangTidy

ClangTidy is a clang-based C++ “linter” tool. Its purpose is to provide an extensible framework for diagnosing and fixing typical programming errors, like style violations, interface misuse, or bugs that can be deduced via static analysis. clang-tidy is modular and provides a convenient interface for writing new checks.

lfs find

This tool provides a subset of the find command. This is the recommended alternative on Lustre based filesystems as it optimizes MDS/MDT usage.

$ lfs find <a_directory> -type f -print

numactl

Control NUMA policy for processes or shared memory. Basically, allows you to bind ranks to a set of hardware thread. This is usually burrowed into binding script such as in this one.

source

Read and execute commands from a file into the current shell environment and return the exit status of the last command executed from filename. This differs from when you run a shell script.

$ source ./environment.sh

xeyes

A Simple tool used to check if X11 forwarding is working. If X11 forwarding is properly setup you should see two eyes looking at your mouse cursor.

../../_images/xeyes.PNG

watch

Execute a program periodically, showing output fullscreen. This is useful to watch the result of a command evolves.

$ watch -n0 -- "rocm-smi"

wget

A non-interactive network downloader. It is useful to retrieve documents on a network and support many protocols (of which, http and ftp).

$ wget "https://github.com/ROCm/rccl/archive/refs/tags/rocm-6.0.0.zip"

Note

Put quotes around the URLs to avoid nasty issues.