Tools
CINES Spack modules
CINES provides tools built using Spack. To access these products, check this document and or look at this product catalog.
Building software
CMake
CMake is a meta-build system. It is used to generate build system configuration files (Makefile, Ninja configuration file, etc.). A great documentation on modern CMake is offered by Henry Schhreiner.
CINES provides the latest CMake release available in the OS’s repositories. To get the very latest release, you can build CMake yourself or rely on pip3
:
$ pip3 install --user --upgrade cmake
$ cmake --version
3.28.1 # The latest release as of 2024/01/01
Note
Using pip
this way assume your .bashrc
or .bash_profile
correctly defines your path to include ~/.local/bin
.
Note
CMake < 3.28 was recognizing CrayClang as a Clang compiler (which it is). With CMake >= 3.28, it recognizes it as CrayClang.
CMake examples
CMake + OpenMP
Assuming you have the following Genoa environment (which is very typical):
$ module purge
$ module load cpe/23.12
$ module load craype-x86-genoa
$ module load PrgEnv-cray
Assuming the following example CMake script:
cmake_minimum_required(VERSION 3.1)
project(example VERSION 1.0 LANGUAGES CXX)
add_executable(example_binary example.cc)
find_package(OpenMP 4.0 REQUIRED)
target_link_libraries(example_binary PRIVATE OpenMP::OpenMP_CXX)
Use the following commands (do not forget the CMAKE_CXX_COMPILER
):
$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC ..
When using the Cray wrappers and craype-accel-amd-gfx90a
, CMake is known to have some issues with Cray’s GPU offload implementation’s special linker (cce_omp_offload_linker
). This causes link time issues or false positive (running on GPU).
Example linker issue:
FAILED: example_binary
: && /opt/cray/pe/craype/2.7.30/bin/CC -dynamic CMakeFiles/example_binary.dir/example.cc.o -o example_binary && :
ld.lld: error: undefined symbol: _cray$mt_kmpc_fork_call_with_flags
>>> referenced by example.cc
>>> CMakeFiles/example_binary.dir/example.cc.o:(main)
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
We propose a work around, assuming you have the following MI250 environment (which is very typical):
$ module purge
$ module load cpe/23.12
$ module load craype-x86-trento craype-accel-amd-gfx90a # <- note the craype-accel-amd-gfx90a
$ module load PrgEnv-cray
Now, at least two situation arise, either you want OpenMP offloading (`#pragma omp target
), or you do not. In the second situation, you may have loaded craype-accel-amd-gfx90a
for the Cray wrappers to link correctly with a library like MPI, LibSci but do not need OpenMP offloading.
If you are in the first situation you should use CMake like that (force the -fopenmp
flag at link time):
$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC -DCMAKE_EXE_LINKER_FLAGS="-fopenmp" ..
If you are in the second situation:
$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC -DCMAKE_LINKER="${CC_X86_64}/bin/cce_omp_offload_linker" ..
CMake + Cray MPICH
Assuming you have the following environment (which is very typical):
$ module purge
$ module load cpe/23.12
$ module load craype-x86-trento craype-accel-amd-gfx90a
$ module load PrgEnv-cray
Assuming the following example CMake script:
cmake_minimum_required(VERSION 3.1)
project(example VERSION 1.0 LANGUAGES CXX)
add_executable(example_binary example.cc)
find_package(MPI REQUIRED)
target_link_libraries(example_binary PRIVATE MPI::MPI_CXX)
Use the following commands (do not forget the CMAKE_CXX_COMPILER
):
$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC ..
Had you not used the Cray wrappers (here the raw Cray C++ compiler crayCC
), the following commands would have been an alternative:
$ mkdir build && cd build
$ cmake -DCMAKE_CXX_COMPILER=crayCC -DCMAKE_CXX_FLAGS="-isystem ${CRAY_MPICH_PREFIX}/include" -DCMAKE_EXE_LINKER_FLAGS="-L${CRAY_MPICH_PREFIX}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}" ..
Note
The ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}
is only needed if you target GPU.
CMake + Cray HDF5 + Cray wrappers
Assuming you have the following environment (which is very typical):
$ module purge
$ module load cpe/23.12
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ module load PrgEnv-amd
$ module load cray-hdf5
Assuming the following example CMake script:
cmake_minimum_required(VERSION 3.1)
project(example VERSION 1.0 LANGUAGES CXX)
add_executable(example_binary example.cc)
# This variable does not affect the build when using the Cray wrappers!
# set(HDF5_PREFER_PARALLEL TRUE)
find_package(HDF5 REQUIRED COMPONENTS C)
target_link_libraries(example_binary PRIVATE ${HDF5_LIBRARIES})
Use the following commands (do not forget the CMAKE_CXX_COMPILER
):
$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC ..
Note
The process is the same for cray-hdf5-parallel
, just swap the HDF5 modules.
CMake + Cray HDF5 + raw compilers
Note
There is an issue with the HDF5 wrappers (h5fc, h5cc et h5c++ et h5pcc, h5pfc etc.) provided by Cray. Make sure you workaround the issue as described below.
Assuming you have the following environment (which is very typical):
$ module purge
$ module load cpe/23.12
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ module load PrgEnv-amd
$ module load cray-hdf5
You need to add the following environment variables:
# Work around an issue in FindHDF5 and Cray's HDF5 compiler wrappers.
export HDF5_CC="amdclang"
export HDF5_CLINKER="amdclang"
export HDF5_CXX="amdclang++"
export HDF5_CXXLINKER="amdclang++"
export HDF5_FC="flang"
export HDF5_FLINKER="flang"
Assuming the following example CMake script:
cmake_minimum_required(VERSION 3.1)
project(example VERSION 1.0 LANGUAGES CXX)
add_executable(example_binary example.cc)
find_package(HDF5 REQUIRED COMPONENTS C)
target_link_libraries(example_binary PRIVATE ${HDF5_LIBRARIES})
Use the following commands (do not forget the CMAKE_CXX_COMPILER
):
$ mkdir -p build && cd build
$ # The -isystem is to work around an issue in FindHDF5 and Cray's HDF5 compiler
$ # wrappers.
$ cmake -DCMAKE_CXX_COMPILER=amdclang++ -DCMAKE_CXX_FLAGS="-isystem ${CRAY_HDF5_PREFIX}/include" ..
Note
The process is the same for cray-hdf5-parallel
, just swap the HDF5 modules.
CMake + Kokkos HIP + OpenMP
Compiling Kokkos with a HIP backend for MI250X and an OpenMP backend for CPUs:
$ module purge
$ module load cpe/23.12
$ module load craype-x86-trento craype-accel-amd-gfx90a
$ module load PrgEnv-cray
$ module load amd-mixed/6.0.0
$ module list
$ git clone -b 4.2.00 https://github.com/kokkos/kokkos.git
$ cd kokkos
$ cmake -B build -S . -DCMAKE_CXX_COMPILER=CC -DCMAKE_CXX_STANDARD=17 -DKokkos_ENABLE_HIP=ON -DKokkos_ARCH_AMD_GFX90A=ON -DKokkos_ENABLE_OPENMP=ON
$ cmake --build build --clean-first --parallel 128
CMake + LibSci
Assuming you have the following environment (which is very typical):
$ module purge
$ module load cpe/23.12
$ module load craype-x86-trento craype-accel-amd-gfx90a
$ module load PrgEnv-cray
Assuming the following example CMake script:
cmake_minimum_required(VERSION 3.1)
project(example VERSION 1.0 LANGUAGES CXX)
add_executable(example_binary example.cc)
find_package(BLAS REQUIRED)
target_link_libraries(example_binary PRIVATE BLAS::BLAS)
Use the following commands (do not forget the CMAKE_CXX_COMPILER
):
$ mkdir -p build && cd build
$ cmake -DCMAKE_CXX_COMPILER=CC ..
Had you not used the Cray wrappers (here the raw Cray C++ compiler crayCC
), the following commands would have been an alternative:
$ mkdir build && cd build
$ cmake -DCMAKE_CXX_COMPILER=crayCC -DCMAKE_CXX_FLAGS="-isystem ${CRAY_PE_LIBSCI_PREFIX}/include" -DBLAS_LIBRARIES="${CRAY_PE_LIBSCI_PREFIX}/lib/libsci_cray.so;" ..
Ninja
Ninja is yet another build system. It takes as input the interdependencies of files (typically source code and output executables) and orchestrates building them, quickly. It plays a similar role to GNU Make but without most of the clutter. It is simpler, faster, more reliable. You should not write Ninja configuration files like you may do for Makefiles. Instead, rely on a meta-build system such as CMake, Meson, etc..
To install Ninja on any machine (assuming you have pip3
installed):
$ pip3 install --user --upgrade ninja
To ask CMake to use Ninja:
$ cmake -GNinja <your_other_flags>
Conda
CINES proposes a conda
module:
$ module load conda
Note
CINES uses a package cache.
Debugging
GDB
GDB is the defacto, system programming debugger. It requires quite a lot of skill to get anything out of it on real world problems (notably if you do not have access to the source). If you plan to use it, make sure that, regardless of the level of optimization you specify to your compiler, you ship the debug info too. This is generally done using -ggdb
on GNU and Clang based compilers.
Note
If you are a novice, you may be better off using ChatGDB with a prompt such as What the root cause of this error is? Given the following source code context and propose a fix. In your response, always refer only to specific lines and filenames of source code..
Valgrind
Valgrind is a suite of tools for debugging and profiling programs. It is mostly used to profile cache behavior (by emulation) and check for memory leaks.
To check for memory leaks:
$ valgrind -- ./my_program
Sanitizers
Compiler toolchains such as GNU or LLVM provide sanitizers (runtime advanced diagnostics) for the C and C++ compilers. On Clang or GCC based compilers, you would use -fsanitize=address
to check for memory leaks. There are more sanitizer available for undefined behavior, race condition, or uninitialized memory usage. Check this document for more details.
ldd
and nm
ldd
prints shared library dependencies. Can be used to check if all shared object dependencies are satisfied. ldd
is recursive and does not stop at the first level of the dependency tree.
nm
lists symbols from object files. This is useful to diagnose undefined reference issue. you have to de-mangle C++ symbols. For that, you can pipe nm
into the c++filt
h command.
gstack
Prints a stack trace of a running process. Can be used in conjunction with watch
to observe how the stack evolves.
watch -n0 -- "gstack <pid>"
Monitoring
htop
An interactive process viewer. This is useful to monitor how your program uses the machine.
Warning
When you launch this tool for the first time, you might get surprised by the layout due to the unfortunately huge amount of cores on Adastra’s login and Genoa nodes. This requires some layout tuning to get a proper reading. Check the configuration below to workaround the issue.
Example configuration to place into ~/.config/htop/htoprc
# Beware! This file is rewritten by htop when settings are changed in the interface.
# The parser is also very primitive, and not human-friendly.
fields=0 48 17 114 18 38 39 40 2 113 46 47 49 1
sort_key=46
sort_direction=1
tree_sort_key=49
tree_sort_direction=1
hide_kernel_threads=1
hide_userland_threads=1
shadow_other_users=0
show_thread_names=0
show_program_path=0
highlight_base_name=1
highlight_megabytes=1
highlight_threads=1
highlight_changes=0
highlight_changes_delay_secs=5
find_comm_in_cmdline=1
strip_exe_from_cmdline=1
show_merged_command=0
tree_view=1
tree_view_always_by_pid=0
header_margin=1
detailed_cpu_time=1
cpu_count_from_one=0
show_cpu_usage=1
show_cpu_frequency=0
update_process_names=0
account_guest_in_cpu_meter=0
color_scheme=0
enable_mouse=1
delay=15
left_meters=LeftCPUs8 Memory Swap NetworkIO DiskIO
left_meter_modes=1 1 1 2 2
right_meters=RightCPUs8 Hostname Tasks LoadAverage Uptime SELinux
right_meter_modes=1 2 2 2 2 2
hide_function_bar=0
rocmsmi
Note
It is somewhat equivalent to NVIDIA’s nvidia-smi
.
One can load the amd-mixed
module and use the rocm-smi
command. A typical workflow would be to start a job using say, sbatch
, do an squeue --me
to obtain the list of node associated to the job. Select a node from the list and connect to it using ssh <node_hostname>
. Do module load amd-mixed
and watch -n0 -- rocm-smi
.
It will give you the following output:
$ watch -- rocm-smi
Every 0.1s: rocm-smi <node>: <date>
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 63.0c 444.0W 1700Mhz 1600Mhz 0% auto 560.0W 30% 100%
1 58.0c N/A 1700Mhz 1600Mhz 0% auto 0.0W 30% 100%
2 63.0c 465.0W 1700Mhz 1600Mhz 0% auto 560.0W 30% 100%
3 69.0c N/A 1700Mhz 1600Mhz 0% auto 0.0W 30% 100%
4 58.0c 443.0W 1700Mhz 1600Mhz 0% auto 560.0W 30% 100%
5 57.0c N/A 1700Mhz 1600Mhz 0% auto 0.0W 30% 100%
6 68.0c 460.0W 1700Mhz 1600Mhz 0% auto 560.0W 30% 100%
7 64.0c N/A 1700Mhz 1600Mhz 0% auto 0.0W 30% 99%
================================================================================
============================= End of ROCm SMI Log ==============================
In addition is can be a trove of information regarding the hardware topology (BUS, PCIe, etc.).
Profiling
Intel VTune & Advisor
CINES proposes both softwares under the following paths:
/opt/software/intel/oneapi/advisor/latest/bin64/advisor-gui
/opt/software/intel/oneapi/vtune/latest/bin64/vtune-gui
Their UI can be used via X11 forwarding.
Cray-PAT’s perftools
Introducing Cray-PAT can be done in multiple ways. First ensure the perftools-base
module is loaded. Then load a perftools-lite*
modules. They exist in flavor variation such as:
perftools-lite
: Gives execution time, memory peak, most time-consuming functions, I/O read and write rates, and total energy consumed.
perftools-lite-events
: Gives the time-consuming functions, MPI and OpenMP (CPU) information, as well as node observations and possible rank order suggestions.
perftools-lite-gpu
: Gives kernel execution times, synchronization time, as well as data movement information (time and size of transfers).
perftools-lite-loops
: Gives loop trip counts and execution times. The results of this profile can be used with Reveal. More on that Later.
Note
Cray provides some documentation on using theses tools.
when the modules given above is loaded, recompile your application using the Cray wrappers. After executing the test case, a folder is created in the working directory in use when you launched the application. At the end of the execution, perftools also print partial results to the standard output. A sample of such output could look like this:
#################################################################
# #
# CrayPat-lite Performance Statistics #
# #
#################################################################
CrayPat/X: Version 22.04.0 Revision 044b9fa85 03/10/22 21:59:07
Experiment: lite lite-samples
Number of PEs (MPI ranks): 1
Numbers of PEs per Node: 1
Numbers of Threads per PE: 1
Number of Cores per Socket: 64
Accelerator Model: AMD MI100 Memory: 32.00 GB Frequency: 1.00 GHz
Execution start time: Tue May 31 10:23:14 2022
System name and speed: g1003 2.304 GHz (nominal)
AMD Trento CPU Family: 25 Model: 48 Stepping: 1
Core Performance Boost: 1 PE has CPB capability
Avg Process Time: 57.62 secs
High Memory: 2,598.3 MiBytes 2,598.3 MiBytes per PE
I/O Write Rate: 108.928487 MiBytes/sec
Notes for table 1:
This table shows functions that have significant exclusive sample
hits, averaged across ranks.
For further explanation, see the "General table notes" below,
or use: pat_report -v -O samp_profile ...
Table 1: Profile by Function
Samp% | Samp | Imb. | Imb. | Group
| | Samp | Samp% | Function=[MAX10]
100.0% | 2,234.0 | -- | -- | Total
|---------------------------------------------------------
| 67.0% | 1,497.0 | -- | -- | ETC
||--------------------------------------------------------
|| 36.5% | 815.0 | -- | -- | __sci_sgemv_
|| 16.0% | 357.0 | -- | -- | openblas_sgemv_t_naples
|| 8.6% | 192.0 | -- | -- | sgemv_kernel_4x2
|| 3.4% | 76.0 | -- | -- | cblas_sgemv
|| 1.9% | 42.0 | -- | -- | sgemv_kernel_4x1
||========================================================
| 32.9% | 735.0 | -- | -- | USER
||--------------------------------------------------------
|| 26.3% | 588.0 | -- | -- | main
|| 2.3% | 52.0 | -- | -- | matrix
|| 2.2% | 49.0 | -- | -- | WriteHdf5MeshFile2D
|| 1.2% | 27.0 | -- | -- | Compute_Epsi
|=========================================================
You can observe in which function the time is spent. The time presented here, in percentage and samples, is exclusive. Meaning that, if you imagine the history if the call stack as a rooted tree (graph theory) the time spent in a function (a vertex) does not account for the time spent in the callee (children vertices).
In the example shown above, most of the time is spent in __sci_sgemv_
a BLAS API to compute a matrix-vector operation.
Instrumenting an already existing executable binary
Cray offers the pat_build
command that can be used to instrument already compiled codes. Still, one must make sure that the perftools-base
and perftools
modules were loaded during compilation. Then, to instrument the binary, one case do the following:
$ pat_build -g <trace_group> -f <path_to_original_binary> -o <path_to_instrumented_binary>
You are to specify what you would like to profile using a trace-group
. The values it can take are given below (you can use multiple separated by a comma):
adios2
Adaptable Input Output System Version 2
aio
Functions that perform asynchronous I/O.
blacs
Basic Linear Algebra communication subprograms
blas
Basic Linear Algebra subprograms
caf
Co-Array Fortran (CCE compiler only)
charm++
Charm++ independent parallel programming system (deprecated)
comex
Communications Runtime for Extreme Scale
converse
Charm++ Converse low-level message passing API (deprecated)
craymem
Cray memory allocation with attributes
cuda
NVidia Compute Unified Device Architecture runtime and driver API
curl
Multi-protocol file transfer API
dl
functions that manage dynamic linking
dmapp
Distributed Memory Application API
fabric
Open network communication services API
ffio
functions that perform Flexible File I/O (CCE compiler only)
fftw
Fast Fourier Transform library (32- and 64-bit only)
ga
Global Arrays API
gmp
GNU MultiPrecision Arithmetic Library
gni
Generic Network Interface API
hbw
High Bandwidth Memory API
hdf5
Hierarchical Data Format library
heap
dynamic heap
hip
AMD Heterogeneous-compute Interface for Portability runtime API
hsa
AMD Heterogeneous System Architecture API
huge
Linux huge pages
io
functions and system calls that perform I/O
jemalloc
heap allocation emphasizing fragmentation avoidance and scalable concurrency
lapack
Linear Algebra Package
lustre
Lustre User API
math
POSIX.1 math functions
memkind
User extensible heap manager API
memory
memory management operations
mpfr
GNU MultiPrecision Floating-Point Library
mpi
Message Passing Interface library
netcdf
Network Common Data Form
numa
Non-uniform Memory Access API (see numa(3))
oacc
OpenAccelerator API
omp
OpenMP API
opencl
Open Computing Language API
pblas
Parallel Basic Linear Algebra Subroutines
petsc
Portable Extensible Toolkit for Scientific Computation. Supported for "real" computations only.
pgas
Parallel Global Address Space
pnetcdf
Parallel Network Common Data Form
pthreads
POSIX threads
pthreads_mutex
POSIX threads concurrent process control
pthreads_spin
POSIX threads low-level synchronization control
realtime
POSIX realtime extensions
scalapack
Scalable LAPACK
shmem
One-sided Remote Direct Memory Access Parallel-Processing Interface library
signal
POSIX signal handling and control
spawn
POSIX realtime process creation
stdio
all library functions that accept or return the FILE* construct
string
String operations
syscall
system calls
sysfs
system calls that perform miscellaneous file management
sysio
system calls that perform I/O
umpire
Heterogeneous Memory Resources Management Library
upc
Unified Parallel C (CCE compiler only)
xpmem
cross-process memory mapping
zmq
High-performance asynchronous messaging API
For CPU profiling, a typical set of trace-group
would be mpi,syscall,io,omp
and for AMD GPUs: mpi,syscall,io,hip
.
When the instrumented_binary
is produced, you can use it similarly to the original binary. Profiling results will be output in results_path
of the form instrumented_binary+<PID>_<timestamp>s
. The folder is placed in the working directory of the executable.
You can finally use pat_report -v <results_path>
and optionally, specify an option to orient what to report: pat_report -v -O <report_option> <results_path>
. the report options are given below. Note that not all report options are available for a given profile. Indeed, you have to record the appropriate information first to be able to report it. Do not expect MPI information from an OpenMP profile.
accelerator
Show calltree of accelerator performance data sorted by host time.
accpc
Show accelerator performance counters.
acc_fu
Show accelerator performance data sorted by host time.
acc_time_fu
Show accelerator performance data sorted by accelerator time.
acc_time
Show calltree of accelerator performance data sorted by accelerator time.
acc_show_by_ct
(Deferred implementation) Show accelerator performance data sorted alphabetically.
affinity
Shows affinity bitmask for each node. Can use -s pe=ALL and -s th=ALL to see affinity for each process and thread, and use -s filter_input=expression to limit the number of PEs shown.
profile
Show data by function name only
callers (or ca)
Show function callers (bottom-up view)
calltree (or ct)
Show calltree (top-down view)
ca+src
Show line numbers in callers
ct+src
Show line numbers in calltree
hbm_ct
Show memory bandwidth data by object, sorted by sample count.
hbm_details
Show hbm data collection statistics, including counts of sampled addresses that could not be mapped to a registered object.
hbm_frees
Show program locations at which objects are freed by explicit calls to free or delete.
hbm_wt
Show memory bandwidth data by object, sorted by aggregate sample weight. The weight estimates the benefit of allocating the object in high bandwidth memory.
heap
Implies heap_program. heap_hiwater, and heap_leaks. Instrumented executables must be built using the pat_build -g heap option or executed with the pat_run -g heap option in order to show heap_hiwater and heap_leaks information.
heap_program
Compare heap usage at the start and end of the program, showing heap space used and free at the start, and unfreed space and fragmentation at the end.
heap_hiwater
If the pat_build -g heap option was used to instrument the program or the program was executed with the pat_run -g heap option, this report option shows the heap usage "high water" mark, the total number of allocations and frees, and the number and total size of objects allocated but not freed between the start and end of the program.
heap_leaks
If the pat_build -g heap option was used to instrument the program or the program was executed with the pat_run -g heap option, this report option shows the largest unfreed objects by call site of allocation and PE number.
himem
Memory high water mark by Numa Node. For nodes with multiple sockets, or nodes with Intel KNL processors, the default report should also have a table showing high water usage by numa node. That table is not shown if all memory was mapped to numa node 0, but can be explicitly requested with pat_report -O himem.
acc_kern_stats
Show kernel-level statistics including average kernel grid size, average block size, and average amount of shared memory dynamically allocated for the kernel.
load_balance
Implies load_balance_program, load_balance_group, and load_balance_function. Show PEs with maximum, minimum, and median times.
load_balance_program, load_balance_group, load_balance_function
For the whole program, groups, or functions, respectively, show the imb_time (difference between maximum and average time across PEs) in seconds and the imb_time% (imb_time/max_time * NumPEs/(NumPEs - 1)). For example, an imbalance of 100% for a function means that only one PE spent time in that function.
load_balance_cm
If the pat_build -g mpi option was used to instrument the program or the program was executed with the pat_run -g mpi option, this report option shows the load balance by group with collective-message statistics.
load_balance_sm
If the pat_build -g mpi option was used to instrument the program or the program was executed with the pat_run -g mpi option, this report option shows the load balance by group with sent-message statistics.
load_imbalance_thread
Shows the active time (average over PEs) for each thread number.
loop_times
Inclusive and Exclusive Time in Loops. If the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used, then this table will be included in a default report and the following additional loop reporting options are also available.
loop_callers
Loop Stats by Function and Caller. Available only if the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used.
loop_callers+src
Loop Stats by Function and Callsites. Available only if the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used.
loop_calltree
Function and Loop Calltree View. Available only if the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used.
loop_calltree+src
Function and Loop Calltree with Line Numbers. Available only if the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used.
profile_loops
Profile by Group and Function with Loops. Available only if the CCE Classic compiler -h profile_generate or CCE compiler -finstrument-loops option was used.
mcdram
Display the MCDRAM configuration for each PE with pat_report -O mcdram ...
mesh_xyz
Show the coordinates in the network mesh.
mpi_callers
Show MPI sent- and collective-message statistics
mpi_sm_callers
Show MPI sent-message statistics
mpi_coll_callers
Show MPI collective-message statistics
mpi_dest_bytes
Show MPI bin statistics as total bytes
mpi_dest_counts
Show MPI bin statistics as counts of messages
mpi_sm_rank_order
Calculate a suggested rank order based on MPI grid detection and MPI point-to-point message optimization. Uses sent-message data from tracing MPI functions to generate suggested MPI rank order information. Requires the program to be instrumented using the pat_build -g mpi option or executed with the pat_run -g mpi option.
mpi_rank_order
Calculate a rank order to balance a shared resource such as USER time over all nodes. Uses time in user functions, or alternatively, any other metric specified by using the -s mro_metric options, to generate suggested MPI rank order information.
mpi_hy_rank_order
Calculate a rank order based on a hybrid combination of mpi_sm_rank_order and mpi_rank_order.
nids
Show PE to NID mapping.
nwpc
Program network performance counter activity.
profile_nwpc
Network performance counter data by Function Group and Function. Table shown by default if NWPCs are present in the .ap2 file.
profile_pe.th
Show the imbalance over the set of all threads in the program.
profile_pe_th
Show the imbalance over PEs of maximum thread times.
profile_th_pe
For each thread, show the imbalance over PEs.
program_time
Shows which PEs took the maximum, median, and minimum time for the whole program.
read_stats, write_stats
If the pat_build -g io option was used to instrument the program or the program was executed with the pat_run -g io option, these options show the I/O statistics by filename and by PE, with maximum, median, and minimum I/O times. The -O io option is a shortcut for both read_stats and write_stats.
samp_profile+src
Show sampled data by line number with each function.
thread_times
For each thread number, show the average of all PE times and the PEs with the minimum, maximum, and median times.
Omniperf
Warning
Ensure you use a profiler from a ROCm in version equal or more recent than the ROCm used to build the binary.
Note
We provide slides given in 2024
on using this tool.
Note
Can be used go into the nitty-gritty details of the GPU kernels. Its at a lower level than Omnitrace.
MI Performance Profiler (Omniperf) is a system performance profiling tool for Machine Learning and HPC workloads running on AMD Instinct Accelerators (MI200, MI300, etc.). It is currently built on top of the rocProfiler to monitor hardware performance counters. Basically it is a pretty interface to rocPROF.
It offers the following analysis related features:
System information;
System Speed-of-Light (SOL);
Kernel statistic;
Memory chart analysis;
Roofline analysis (supported on MI200 only, SLES 15 SP3 or RHEL8);
Command Processor (CP);
Shader Processing Input (SPI);
Wavefront launch;
Compute Unit (CU) - instruction mix;
Compute Unit (CU) - pipeline;
Local Data Share (LDS);
Instruction cache;
Scalar L1D cache;
Texture addresser and data;
Vector L1D cache;
L2 cache;
L2 cache (per-channel).
On Adastra, we provide a module (currently hidden) which you can load like so:
$ module load rocm/5.5.1
$ module load .omniperf/1.0.10
To use the tool, please refer to their official documents https://amdresearch.github.io/omniperf/getting_started.html#quickstart. A good video introduction to the tool is given by Cole Ramos in GPU Profiling (Performance Profile: Omniperf).
GPU roofline
Omniperf uses rocPROF to build rooflines graphs for you. It can be used like so:
$ omniperf profile -n case.roofline --roof-only --device=0 -- ./test_binary
It will start the test_binary
as many time as needed for all the counter to be retrieved and run exclusively on the device 0
. When the process finishes, you will find PDF files in <working_directory>/workload/case.roofline/
. These documents are roofline graph.
Note
Be careful, Omniperf will run your binary multiple times. Ensure that there is reproducibility between runs.
Omnitrace
Warning
Ensure you use a profiler from a ROCm in version equal or more recent than the ROCm used to build the binary.
Note
We provide slides given in 2024
on using this tool.
Note
Closer to a typical profiler (say VTune, perf
) with additional AMD GPU kernel details.
MI Performance Tracer (Omnitrace) is designed for both high-level profiling and comprehensive tracing of applications running on the CPU or the CPU+GPU via dynamic binary instrumentation, call-stack sampling, and various other means for determining currently executing function and line information.
Visualization of the comprehensive Omnitrace results can be viewed in any modern web browser by visiting ui.perfetto.dev and loading the perfetto output (.proto files) produced by Omnitrace.
It offers the following analysis related features:
Dynamic instrumentation; - Runtime instrumentation; - Binary rewriting;
Statistical sampling;
Process-level sampling;
Causal profiling;
High-level summary profiles with mean/min/max/stddev statistics;
Application speedup predictions resulting from potential optimizations in functions and lines of code (causal profiling);
Critical trace generation;
Critical trace analysis;
HIP;
HSA;
Pthreads;
MPI;
Kokkos-Tools (KokkosP);
OpenMP-Tools (OMPT);
GPU hardware counters;
HIP API tracing;
HIP kernel tracing;
HSA API tracing;
HSA operation tracing;
System-level sampling (via rocm-smi);
CPU hardware counters sampling and profiles;
CPU frequency sampling;
Various timing metrics;
Various memory metrics;
Network statistics;
I/O metrics;
- Third-party API support:
TAU;
LIKWID;
Caliper;
CrayPAT;
VTune;
NVTX;
ROCTX.
On Adastra, we provide a module (currently hidden) which you can load like so:
$ module load rocm/5.5.1
$ module load .omnitrace/1.10.4
To use the tool, please refer to their official documents https://amdresearch.github.io/omnitrace/getting_started.html.
perf
perf
is the performance analysis tool for Linux. It is developed as part of the Linux kernel. This tools is simple to use, lightweight and versatile. You can record traces of many kind and exploit them using specific tools such as the hotspot visualizer.
Note
If you want to profile a specific rank of an MPI based, multi rank program, use a trivial shell wrapper to profile only if the SLURM_PROCID
is equal to the rank you wish to profile: if [ "${SLURM_PROCID}" == "0" ]; then exec -- perf -- <my_program>; else exec -- <my_program>; fi
.
Assuming you want to profile a program my_program
:
$ perf record -o perf.data --call-graph dwarf --event instructions,cpu-cycles,cache-misses,branches --aio --sample-cpu -- ./my_program
rocPROF
Warning
Ensure you use a profiler from a ROCm in version equal or more recent than the ROCm used to build the binary.
Note
We provide slides given in 2024
on using this tool.
rocprof
is the performance analysis tool for GPGPU oriented AMD GPUs.
Note
When doing an initial profiling, you may benefit from using export AMD_SERIALIZE_COPY=3 AMD_SERIALIZE_KERNEL=3 GPU_MAX_HW_QUEUES=1
. This will force kernels to be serialized on one physical stream which may ease resource usage analysis of the kernels and generally gives a cleaner trace.
We recommend that you use a small script to wrap rocPROF:
#!/bin/bash
if [ "${SLURM_PROCID}" == "0" ]; then
# Here we prepare a simple GPU profiling, nothing fancy.
exec -- rocprof --stats --timestamp on -o stats_${SLURM_JOBID}-${SLURM_PROCID}.csv "${@}"
else
exec -- "${@}"
fi
This scripts serves to select which rank is to be profiled. But also, to properly label the trace data to disk using SLURM_PROCID
.
Warning
In the script above, only the first rank (0) is going to be profiled. Change the "${SLURM_PROCID}" == "XXX"
if needed.
Simple GPU profiling
This tools can be used to gather GPU kernel statistics.
For rocPROF to be available in your environment, the correct amd-mixed
or rocm
module (if you did not use PrgEnv-amd
). We recommend the rocPROF built in ROCm versions equal or more recent than 5.3.0. Earlier rocPROF versions had crippling bugs.
The first step is to map the landscape. This can be done by asking rocPROF to produce simple statistics about all the kernel like so:
$ rocprof --stats --timestamp on -o stats_${SLURM_JOBID}-${SLURM_PROCID}.csv <executable> <arguments>
This will produce a CSV output composed of the following significant files: stats_<slurm_job_id>-<slurm_mpi_rank>.csv
and stats_<slurm_job_id>-<slurm_mpi_rank>.stats.csv
. This first files represents the list of all information gather for each kernel launched.
Sample of the first file (.csv
) after formatting the CSV as a table:
Index |
KernelName |
gpu-id |
queue-id |
queue-index |
pid |
tid |
grd |
wgr |
lds |
scr |
vgpr |
sgpr |
fbar |
sig |
obj |
DispatchNs |
BeginNs |
EndNs |
CompleteNs |
DurationNs |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
void thrust::hip_rocprim::__parallel_for::kernel<256u, 1u, thrust::hip_rocprim:: … [clone .kd] |
0 |
1 |
0 |
796946 |
796946 |
11796480 |
256 |
0 |
0 |
8 |
24 |
94656 |
0x0 |
0x7f1e35826300 |
1191607302876252 |
1191607302920469 |
1191607303043669 |
1191607303053971 |
123200 |
1 |
void rocprim::detail::scan_batches_kernel<256u, 2u, 6u>(unsigned int*, unsigned int*, unsigned int) [clone .kd] |
0 |
1 |
48 |
796946 |
796946 |
16384 |
256 |
512 |
0 |
16 |
24 |
36864 |
0x0 |
0x7f1e35825bc0 |
1191607404624006 |
1191607404963922 |
1191607404970162 |
1191607404973843 |
6240 |
2 |
__omp_offloading_30_40e0070__ZN5Field6put_toEd_l21_cce$noloop$form.kd |
0 |
0 |
4 |
796946 |
796946 |
68352 |
256 |
0 |
0 |
4 |
24 |
68224 |
0x0 |
0x7f1e2fdc8240 |
1191607515378464 |
1191607515391455 |
1191607515400735 |
1191607515402559 |
9280 |
Sample of the second file (.stats.csv
) after formatting the CSV as a table:
Name |
Calls |
TotalDurationNs |
AverageNs |
Percentage |
---|---|---|---|---|
void hip::kernel::DepositCurrentDensity_2D_Order2<double, float, 128ul>(double*, … [clone .kd] |
600 |
8559183476 |
14265305 |
50.06434641750023 |
void rocprim::detail::partition_kernel<(rocprim::detail::select_method)1, true, … [clone .kd] |
600 |
1260501300 |
2100835 |
7.372919849172466 |
void rocprim::detail::partition_kernel<(rocprim::detail::select_method)1, true, … [clone .kd] |
600 |
1247304037 |
2078840 |
7.29572646402685 |
__omp_offloading_30_40e035b__ZN20Interpolator2D2Order13fieldsWrapperEP11ElectroMagnR9ParticlesP9MPIPiS6_iji_l186_cce$noloop$form.kd |
600 |
1245721159 |
2076201 |
7.2864678994977865 |
__omp_offloading_30_40e0060__ZN11PusherBorisclER9ParticlesP9MPIiiii_l57_cce$noloop$form.kd |
600 |
1040969461 |
1734949 |
6.088834974933595 |
From theses two sample tables we can note some interesting information such as:
=sum(DurationNs of a given kernel) / Calls of a given kernel
).=sum(DurationNs of a given kernel) / sum(DurationNs of all kernels
).=wgr*grp_count
).=grd/grp_count
).Note
The rocPROF overhead is not very high (~ 5-10%) when only asking for basic statistics (--stats
) but the amount of data can grow very fast!
Note
rocPROF gather data on every kernel launched on a given GPU, that is, your kernels or other’s kernels. Make sure only your code is executing on it, else you will get noisy reports.
Precise GPU profiling
This topic can become very advanced quite quickly, here we only present a way to gather information, not how to interpret the results which, depending on your load could differ. rocPROF can be fed with a -i <file_name>
option. This file_name
point to a configuration file inside which we can specify a list of hardware counters or metrics (based on hardware counters) sampled at runtime. One can not ask to log all the hardware counters due to hardware limitations.
Here is such a configuration file with often relevant metrics:
pmc : VALUUtilization VALUBusy L2CacheHit LDSBankConflict ALUStalledByLDS
Where the metrics signify:
More information can be found on the rocPROF Github page and by reading the documentation on the available metrics (also see $ rocprof --list-basic
and $ rocprof --list-derived
.
Note
A detailed list of available hardware counters and their respective meaning is available here: https://rocm.docs.amd.com/en/latest/conceptual/gpu-arch/mi200-performance-counters.html
Note
One can choose to restrain the set of kernel on which data should be collected by adding the line: kernel: <kernel_symbol_00>
. Where kernel_symbol_00
is the kernel name (you can get it from a rocPROF .stats.csv
file if unknown due to, say OpenMP). You can restrain the set to more than one kernel.
GPU roofline
The roofline model enables the developer to quickly assert how far from the theoretical peak performance a specific piece of code is.
Effectively, it plots the floating point throughput (Flop/s) as a function of the arithmetic intensity of kernels. The arithmetic intensity represents the amount of Flop per unit of data loaded from (Flop/octet), typically, cache or main memory.
Then, one derives a theoretical peak performance based on the capacities of the hardware. The shape of this upper bound performance has the shape of a roofline, thus the name.
In practice we have multiple rooflines, one for Binary64 (double), one for Binary32 (float) etc..
In case of the MI250X cards and as an example, we can derive, for a GCD, a Binary64 roofline based on the peak theoretical memory throughput \(\textrm{B} = 1.6\) Tio/s and a peak Flop/s \(\textrm{F} = 23.9\) TFlop/s. The memory throughput, like the peak Flop/s can also be experimentally derived using benchmarks.
Plotting \(\textrm{P}\) as a function of \(\textrm{I}\) gives:
Then using rocPROF and specific hardware counters, we can obtain, for each kernel, the arithmetic intensity and its associated floating point throughput. Under the assumption that one does not mix floating point precision inside a given kernel, we provide the following rocPROF configuration files for Binary64 (double):
pmc : TCC_EA_RDREQ_32B_sum TCC_EA_RDREQ_sum TCC_EA_WRREQ_sum TCC_EA_WRREQ_64B_sum SQ_INSTS_VALU_ADD_F64 SQ_INSTS_VALU_MUL_F64 SQ_INSTS_VALU_FMA_F64 SQ_INSTS_VALU_TRANS_F64 SQ_INSTS_VALU_MFMA_MOPS_F64
For Binary32 (float) we would have:
pmc : TCC_EA_RDREQ_32B_sum TCC_EA_RDREQ_sum TCC_EA_WRREQ_sum TCC_EA_WRREQ_64B_sum SQ_INSTS_VALU_ADD_F32 SQ_INSTS_VALU_MUL_F32 SQ_INSTS_VALU_FMA_F32 SQ_INSTS_VALU_TRANS_F32 SQ_INSTS_VALU_MFMA_MOPS_F32
And for Binary16 (half) and Brain16 we would have:
pmc : TCC_EA_RDREQ_32B_sum TCC_EA_RDREQ_sum TCC_EA_WRREQ_sum TCC_EA_WRREQ_64B_sum SQ_INSTS_VALU_ADD_F16 SQ_INSTS_VALU_MUL_F16 SQ_INSTS_VALU_FMA_F16 SQ_INSTS_VALU_TRANS_F16 SQ_INSTS_VALU_MFMA_MOPS_F16 SQ_INSTS_VALU_MFMA_MOPS_BF16
The data gathered using rocPROF and the counters above needs to be further processed to obtain the kernel’s arithmetic intensity \(\textrm{I}\) and the Flop/s \(\textrm{F}\).
If one seeks to analyze the Binary32 data, only the \(\textrm{Flop}\) computation needs to change (use the correct counters by replacing 64
by 32
). Note that for the MFMA operations, in 32 and 64 bits floats, represents 256 floating point operations, and for 16 bit floats it represents 1024 operations. Also, for 16 bits floats, there is SQ_INSTS_VALU_MFMA_MOPS_BF16
to take into account in addition to SQ_INSTS_VALU_MFMA_MOPS_F16
.
You can then plot the the points for each kernel, giving:
Note
You could also use the LDS, L1 or L2 cache throughput to build the roofline but you need to replace the HBM usage counter with the appropriate LDS, LI or L2 counters. Do not compare a measured HBM throughput and AI to a roofline other than the HBM one.
rocPROFv2
rocprofv2
is a wrapper around rocprof
. It was conceived to simplify trace generation. It can be used like so:
$ rocprofv2 --[hip-trace|hsa-trace|roctx-trace|kernel-trace|sys-trace] <executable>
Python
Virtual environment
To avoid sullying you home with unrelated Python packages, you can make use of Python’s virtual environment (VEnv). Here is a simple script showing you how to do that properly on Adastra:
#!/bin/bash
# enable only if you do not source this script.
# set -eu
module purge
module load cpe/23.12
module load cray-python
module list
python3 -m pip install --user --upgrade pip
pip3 install --user --upgrade virtualenv
python3 -m virtualenv ./python_environment
chmod +x ./python_environment/bin/activate
source ./python_environment/bin/activate
python3 -m pip install --upgrade pip
Then to activate the virtual environment, do:
$ source ./python_environment/bin/activate
When the virtual environment is activated you can freely install packages, for instance:
$ pip3 install transformers datasets
Note
To deactivate the virtual environment, simply enter deactivate
in the command line.
Warning
Be careful with the cray-python
module, it defines the following variables: ${PYTHONPATH}
/${PYTHON_PATH}
which changes some VEnv library path. Generally, once the VEnv is setup, you DO NOT want to load the module anymore.
Trivia
ClangFormat
ClangFormat is a tool that enables automatic source code formatting. It is based on Clang (parsing).
ClangTidy
ClangTidy is a clang-based C++ “linter” tool. Its purpose is to provide an extensible framework for diagnosing and fixing typical programming errors, like style violations, interface misuse, or bugs that can be deduced via static analysis. clang-tidy is modular and provides a convenient interface for writing new checks.
lfs find
This tool provides a subset of the find
command. This is the recommended alternative on Lustre based filesystems as it optimizes MDS/MDT usage.
$ lfs find <a_directory> -type f -print
numactl
Control NUMA policy for processes or shared memory. Basically, allows you to bind ranks to a set of hardware thread. This is usually burrowed into binding script such as in this one.
source
Read and execute commands from a file into the current shell environment and return the exit status of the last command executed from filename. This differs from when you run a shell script.
$ source ./environment.sh
xeyes
A Simple tool used to check if X11 forwarding is working. If X11 forwarding is properly setup you should see two eyes looking at your mouse cursor.
watch
Execute a program periodically, showing output fullscreen. This is useful to watch the result of a command evolves.
$ watch -n0 -- "rocm-smi"
wget
A non-interactive network downloader. It is useful to retrieve documents on a network and support many protocols (of which, http and ftp).
$ wget "https://github.com/ROCm/rccl/archive/refs/tags/rocm-6.0.0.zip"
Note
Put quotes around the URLs to avoid nasty issues.