Porting & optimization

Many programmers appreciate shiny new software, but few are prepared to start over or refactor. This is an expensive task in terms of human resources and time (and therefore money). The process of porting from a scalar architecture (say, CPU) to an accelerated/vectorial architecture (say, GPU) bears similarities to the reverse of what was asked of the simulation communities decades ago, when they had to move from a vector programming model (towards the 90s) to an SPMD model (mainly based on MPI or PVM). Today, we want to retain the SPMD model to distribute calculations, but within each program instance (for example, an MPI rank), we expect the program’s calculations to be compatible with a vector processor. Tools exist to avoid having two versions of the same code, one for CPUs and one for GPGPUs, see below.

This part has various guidelines for porting and optimizing code for Adastra, but in the end, what matters most after the method is your team’s ability to rewrite or modify your code so that it can exploit the vectorial nature of GPGPUs; and that is experience no set of guidelines can provide.

Note

As a TLDR; one can make a pretty bold yet educated assumption by saying that for most of HPC code, the time to solution is either bound by memory latency or memory throughput. In fine, making them a glorified or fancy version of memcpy.

Porting to GPU

The fact is that there is no platform independent program and thus no such thing as portable programs, only ported implementation of an algorithm. What exists though, is platform commonalities. The most ported algorithm generally emanate from or lead to a standard which tries to make use of these platform commonalities. Future proofing your implementations is impossible by definition but you can try to mitigate the problem by relying on a wildly used and successful standard. One should never go off the rails and disrespect the standard unless ones knows what he is doing. This happen often without one knowing it though! Thus the first step in porting an algorithm is making sure that in the first place, the program respects the standard (read books, do tests, attend or watch talks and read the full manual).

In this chapter we present some advantages and disadvantages to GPUs without going into the details of the Instruction Set Architecture (ISA) nor into the microarchitecture. We present technologies one can use to manipulate the GPU. Then we present the minimal speedup acceptable to make use of the GPUs. Finally we expose typical a porting process.

GPUs, pros and cons

Nowadays, CPU manufacturer tend to stagnate on frequency but still succeeded in packing more cores and transistor on the die. CPU are historically and for practical purposes, heavily optimized machines that treat a single thread as fast as possible and aiming to provide somewhat low latencies on the instructions it executes. The modern CPU tries to master two fields, single thread latency and throughput.

In a purely single threaded environment, say Single instruction single Data (SISD) in Flynn’s taxonomy, latency is the inverse of throughput. That is not the case of multithreaded environment and as such, the CPU tries to do two different things well. Note that modern processor are absolutely not SISD as they notably contain instruction pipelines.

The GPU, conceptually simpler than the CPU, is conceived to execute a similar treatment on independent data. It maximizes operation throughput and is very efficient in doing so.

Important

If you seek to do greener computations, make good use of the GPUs, especially in large or national sites where the energy efficiency is high.

If your algorithm exposes parallelism of treatment on independent data you may benefit significantly from using GPUs. The list below exposes more potential bottleneck potentially making a GPU implementation impractical:

low volume of data;

too many data dependency, data processing order (loop carried dependency, atomics, heavy synchronization);

unsupported data type (convoluted C++ object, large allocation inside kernels, too many temporary scratch buffers);

high code/algorithmic complexity;

too many copies between CPU and GPU memory;

low arithmetic density (time to solution dominated by memory access, too much random accesses);

need for bit reproducibility with CPU results;

the use of double precision IEEE 754 numbers (although high-end cards tend to lose this limitation). Also, GPUs do not necessarily provide a strict IEEE 754 standard implementation in all circumstances (see the respective ISA);

the organization of the data in Array Of Structure (AOS), as opposed to Structure Of Array (SOA) or Array Of Vectorized Struct (AOVS);

the use of non-scalable algorithms (as opposed to embarrassingly parallel algorithms);

computational kernels that are too large (too many temporary variables and registers used leading to spilling);

highly coupled code or functions with large side effects (as opposed to the pure functions of functional programming).

For each element given above, if your algorithm and program is subject to it, you may have to engineer a new solution which can range from a trivial loop interchange to a partial rewrite of the algorithm.

Available programming technologies

First and foremost, when a developer seek to solve a problem in a production code, he shouldn’t have to reimplement what already exist. His goal is in fact to produce a code made of component which he can assemble to solve a larger problem. This is the spirit that lead to, for instance, the Basic Linear Algebra Subroutines (BLAS) specification. This separation between the interface that is how the function modifies the visible state of the program and the implementation of this state transition is key to reusable software. In this section, we introduce some technologies available on Adastra. Not all the technologies require the same level of involvement in understanding the underlying hardware and usage nor does it provide the same feature set.

It is the job on the researcher and implementer to clearly define:

What problem are you solving ?

How will we do it (which algorithm will you use) ?

How will it be implemented ?

Specialized computation libraries

When creating a code and the algorithm is know or, when porting a code, we generally start with a quick look at the last bullet, how will it be implemented and we search for the appropriate tool, decently available (ported) on the architectures we care about.

We provide a list of software that may satisfy your needs in this document. If your requirements are not met, you can install the software yourself or ask svp@cines.fr to provide it.

Intra-process parallelization technologies

We provide a list of compiler toolchain that may satisfy your needs in this document. If your requirements are not met, you can install the software yourself or ask svp@cines.fr to provide it.

Opinionated default choice

What would an ideal tool be like ? It would do the job, and last long. Potentially evolve as the job changes. Doing the job is a matter of feature offered and their state of implementation. Lasting long is a matter of how used is the technology, how does the community support materialize and where is the money coming from.

If your code is in Fortran, you have few viable choices when it comes to GPU programming. We would recommend OpenMP for the reason given above. Notably, Clang LLVM is getting a significant push thanks to US exascale projects and AMD’s open source contribution via AOMP. If your code is in C or C++, you have multiple choices ranging from handwritten GPU assembly or high level wrappers (or abstraction layers). Most scientific codes, say 80-90 % (anecdotal) of a code base can be written in term of a high level technology like Kokkos or OpenMP. Most loops are simple (or should be if you are familiar with HPC notions).

OpenMP has a long and great track record when it comes to parallelizing codes on shared memory, multithreaded systems. Since version 4.0, OpenMP has started gaining accelerator (say, GPU/FPGA) offloading capabilities. Today, well established compilers like Clang LLVM, support a lot of OpenMP v5.0 features providing a significant framework for parallelizing tasks both on the CPU and GPU.

OpenACC is also interesting but most of its support is only available on the Nvidia proprietary compiler. The latest features are not implemented in the Cray compiler (currently there is full support for version 2 of the standard and mixed feature support up to version 3.1 for Fortran code only). As such, it may not be possible to run an OpenACC code developed for the Nvidia compiler on a machine such as Adastra. Some modification are to be expected (this shouldn’t be hard as long as you are true to the standard).

It is safe to say that for simple loops, OpenMP give similar performance to simple HIP or CUDA code. On the other hand, for large loops, loops accessing the memory in a non trivial way or code that requires specific synchronization, OpenMP tends to under perform. This is due to the fact that OpenMP tries satisfy a lot of use cases through heavy use of abstractions. It has the side effect of adding a lot corner cases that the OpenMP implementation needs to account for to be compliant. These corner cases’ complexity leads to a very heavy burden left on the compiler’s optimizer. A non optimized OpenMP code will be slow, especially on GPU like accelerators. A lot of effort is going into optimizing OpenMP code (on the compiler side of thing), notably for use on GPUs.

Important

Most of the offloading can be done using a high level technology. Keep the platform specific implementation as small as possible. OpenMP is a great tool is reducing this platform specific surface.

Important

When choosing to add a technology to your code base, ask yourself, am I using this tool just because trendy. Be wary of exotic technologies. Know that hardware portability is not performance portability.

Important

Try to minimize trivial dependencies.

Message passing library

MPI is the defacto standard for HPC message passing followed maybe by Communication Collective Libraries like NCCL/RCCL.

CINES provides a document describing the use of MPI implementation exposed on Adastra.

Minimal viable speedup

It is well known that GPUs are tremendously more efficient at doing computation or reading memory than CPUs are. GPUs also consume more power per chip. An MI250X node (water cooled) drawing at least 2240 W (at peak) while a similar generation scalar node draws at least 720 W (at peak). A back of the envelop calculation tells you that when running your code on an Adastra GPU node, you should observe at least a x3.1 speedup when comparing to an Adastra scalar node. Under that x3.1 threshold you are wasting power and your code works better on CPU. Now, that does not take into account many parameters, such as the real power consumption during the computation. Indeed, an under utilized GPU will reduce its consumption by a lot, while a CPU is not able to be that versatile. It does not take into account the efficiency of the computation realized on the node. Indeed, giving one joule to a GPU and you get, for an MI250X, something like 85 GFlop while a Genoa will give you at most, something like 10 GFlop. There is a x8 factor in favor of the GPU. Finally, it does not take memory bound problems into account, the Adastra accelerated node gives you an maximum aggregated memory HBM/RAM throughput of 12.8 Tio/s and much less is offered by the scalar nodes.

Note

A x3.1 speedup is a good baseline for the minimum speedup one should absolutely strive to get when comparing compute durations on Adastra’s scalar and accelerated nodes. But realistically, you should be in the x4 to x6 range (or even above).

Important

If you seek to do greener computations, make good use of the GPUs.

Porting Process

In this section we explore a porting process we use at CINES. We suppose theses cases where you possess a:

C/C++/Fortran code with some or no OpenMP CPU pragmas (say, pragma omp parallel for).
C/C++ code and OpenACC offloading.
C++ code with CUDA.

We start by presenting a typical porting process for the first two cases, that is, going from a purely CPU code (C/C++/Fortran with potentially some OpenMP loops) or going from OpenACC offloading to OpenMP offloading. We will then describe how a CUDA code can mostly be automatically translated into HIP.

From CPU only or OpenACC offloading to OpenMP offloading

If your code is in Fortran and uses OpenACC you may want to try compiling your code using the Cray compiler. Else, if your C or C++ code uses OpenACC, know that there is no viable compilers that supports it on AMD GPUs. You will have to transcribe it to OpenMP.

The following diagram shows the stages one could go through to port a code. It is not set in stone but should provide a viable framework. We will go through each stage.

../_images/porting_process_en.jpg — An example of a porting process.

Phase 1: Planning

The stage can be anecdotal depending on the context. If you are to ask for an HPC sites’ help, please be careful in clearly defining a set of contact and creating a proper environment for the developers. This includes, access to the code repositories (prefer Git if possible) and proper branch and code maintenance.

By branch maintenance, understand usage of a git branching model like Gitflow. Remember to frequently, say each weeks (depending on the commit throughput), merge the develop branch into your own branch. That is, frequently retrieve the changes of the other developers to avoid future merge conflicts when you will merge your changes with the develop branch.

By code maintenance, understand code formatting, coding paradigm, sanitizer (leak, undefined behavior, address), Continuous Integration (CI) etc. (see Code quality).

It is crucial to define the concept of branch and code maintenance if not already done. It is better to have one, even imperfect, than having none.

Having OpenACC offloading coexist with OpenMP offloading is not recommended (they solve the same problem). Disabling OpenMP offloading is not supported (you either use OpenMP or you don’t). For theses reasons, we recommend an admittedly stylistically unsatisfactory but practical solution: the use of conditional compilation. One would wrap every OpenMP or OpenACC offloading pragmas and API calls like so:

#if defined( ACCELERATOR_GPU_OMP )
    // OpenMP offloading related API calls or pragmas
    #pragma omp <omp_offloading_construct>
#elif defined( ACCELERATOR_GPU_ACC )
    // OpenACC offloading related API calls or pragmas
    #pragma acc <acc_offloading_construct>
#else
    // OpenMP CPU related API calls or pragmas
    #pragma omp <omp_construct>
#endif
for(std::size_t i = 0; i < N; ++i) {
    // Computation
}

Note

It is a good practice to prefix your C definitions or preprocessor definitions with the name of your code so as to not risk mixing definitions from other codes. In C++ you should also use the concept of namespace. For a preprocessor definition ACCELERATOR_GPU_OMP defined by the SUPERCFD code you could have SUPERCFD_ACCELERATOR_GPU_OMP.

Finally, the researcher or implementer should define the subset of the features he wishes to be ported.

Phase 2: Test cases and porting the compilation

The second phase starts by making sure the software to port is able to compile on CPU (without any GPU offloading). At this point may have to you modify your makefiles, machine files, CMakeLists files etc.

In practice the user should only have to load the appropriate modules, which in the case of the porting of the compilation towards a the use of AMD GPUs could like that:

$ module purge
$ # A CrayPE environment version
$ module load cpe/23.12
$ # An architecture
$ module load craype-accel-amd-gfx90a craype-x86-trento
$ # A compiler to target the architecture
$ module load PrgEnv-cray
$ # Some architecture related libraries and tools
$ module load amd-mixed

One would then make sure his compilation script uses the cc, CC or ftn compiler wrappers. One would define the preprocessor definition by adding this compiler flag -DACCELERATOR_GPU_OMP=1 (taken from the example above !) and would enable the OpenMP functionalities with this compiler flag -fopenmp`.

Assuming a common makefile pattern, we could summarize the a C++ compilation with:

$ export CXX=CC
$ export CXXFLAGS="-fopenmp -DACCELERATOR_GPU_OMP=1"
$ export LDFLAGS="-fopenmp"

You shall find the information regarding the compiler available for you to use in the Programming environment section. But know that the Cray and AMD compilers are Clang/Flang LLVM based and thus share most of their flags. For C and C++ code we strongly recommend using at least the following flags -Wall -Wextra -pedantic -g (-g has no overhead except for the file size).

Once the compilation succeeds, one should validates that the program behaves as expected. In software engineering eden, a bunch of so called unit tests, integration tests and system tests would be run. The HPC culture is somewhat lacking on the subject and often, only system testing (in the form of regression testing) is done.

Going with the strict minimum, we recommend that the user provide multiple test cases covering the code that needs to be ported. We expect:

multiple short test cases (say, 2), on the order of tens of seconds in time to solution;
at least one medium case with a TTS of a few 1000s of seconds;
and a large test case with a TTS of a few hours.

Note

A test case in itself is useless as it only produces data. Please always provide test cases with the reference results one should expect and tools to compare the results.

Important

You may strongly consider investing a few hours into developing a validation script capable of, compiling, running and comparing the result of a test case with its reference value.

Theses test cases shall now be ran on CPU using the freshly added support for the Adastra machine. You must assure yourself that the CPU test cases results match the reference.

Phase 3: CPU Profiling

Cray-PAT’s `perftools`

We now know that the compiler we will use to exploit the offloading capabilities of OpenMP did not break the CPU code.

We can proceed to do some profiling. This profiling may not be necessary if the implementer already knows which part of the code needs porting. For the unfamiliar though, it will give a direction on where to start.

Multiple profiling tools are available on Adastra. We will limit ourselves to the Cray Performance Analysis Tool (Cray-PAT). CrayPAT allows the collection of data on the whole application. One can restrict the profiling to a selection of predefined functions to be instrumented (MPI, HIP, BLAS, user only, etc.). An overall profile of the executable provides the total time consumed by the program and its functions. Samples can also be taken from the call stack. Sampling (that is, pausing the program, reading the stack and resuming the program) is typically the least intrusive profiling method. Unfortunately, in the case of Cray-PAT, most of the time the user will have to recompile his code.

We document its use in this document.

Using `perf`

perf is the performance analysis tool for Linux. It is the bread and butter of C/C++ low level profiling by sampling. We provide an example on how using this tool in this document.

Phase 4: Hot spot porting

Now knowing on which hot spot to start we can introduce OpenMP pragmas. Remember, until now, our code is CPU only, potentially with a flavor of OpenMP CPU. The idea is to progressively port the hot spots without caring about optimized memory transfers.

First, lets introduce some General Purpose computation on GPUs (GPGPU) programming notions. A common practice amongst accelerator manufacturer is to stitch multiple compute units together. These compute units are to work on ideally independent work-items (with low to no amount of synchronization needed). A work item is made of a treatment and the data the treatment manipulates. Independent work-item does not necessarily mean different treatment and for this reason, a similar data transformation can be used on different data, this is the concept of Single Instruction Multiple Data (SIMD). AMD GPUs are made of programmable SIMD Arithmetic and Logic Units (ALU) laid out in compute units. The compute units form a grid clearly visible on the chip’s die. The treatments executed inside a workgroup may rely on extensive synchronization though, at a performance cost. Heavy synchronization is not viable between workgroup. From this partial definition of an AMD GPU, we can already gain a lot of programming and performance insight.

Straightaway, we would like to expose you to the terminology used to describe accelerator concepts. Their definitions will come progressively as we introduce OpenMP offloading concepts.

OpenMP	OpenACC	AMD hardware	HIP	CUDA
SIMD	None	None (ambiguous)	None (ambiguous)
Thread	None	SIMD lane/work-item/thread	Thread/work-item	Thread
None		16 SIMD lanes	Wave
None		Wavefront (64 threads/SIMD lanes) \| Warp (32 threads)
Team	Gang	Workgroup		Thread block
None		Compute unit (CU)		Streaming Multiprocessor (SM)
thread_limit	vector_length * num_workers	Workgroup size		Thread block size
num_teams	num_gangs	Workgroup count		Thread block count

Note

OpenACC’s vector and worker are often misunderstood. In practice, for most uses, you may safely ignore worker and rely only the vector as an alias for Thread in OpenMP, HIP or CUDA terminology. Assuming a worker count defaulting to 1, the vector width represents the number of thread in a thread block.

OpenMP is very declarative in its design, it is used to describe an end result (without much care for the implementation). It does not express technical details about the how, but a documentation of the the what (expected result). One could argue that, as long as the end result is correct as per the documentation, the implementation is valid. Though sometimes, very specific behaviors are expected behind vague OpenMP constructs. As such, it is very valuable to see how an OpenMP implementation works, some details can be read on the GNU OpenMP implementation documentation and the source code of the GNU or the LLVM implementation.

Note

OpenMP runtime API apart, the standardization comity is introducing an increasing amount of tools in the form of directive which one can use to express assumptions (i.e., order(concurrent), schedule(nonmonotonic), omp loop, omp allocate, omp assume, etc.). Remember that OpenMP is declarative, so as long as the result is correct, the means can vary a lot, you can often only hint the compiler though of course, some behavior is so expected that you may rely on them, but at your own risk of depending on a specific implementation. The more declarative your pragmas are, the more subject to the implementation’s quality you are.

OpenMP is quite elegant in its handling of accelerator architecture. By default, OpenMP manipulates the shared virtual memory environment of a machine called the host device. The target clause allows the user to specify that for a code block (scope or structured block) the execution environment is that of a machine called the target device. The change of environment is also called control flow transfer. During this control flow transfer, the accelerator’s memory can be manipulated, memory copies can be done between the host device and the target device at the beginning and at the end of the code block. Copies of scalars like types are automatic (even C arrays of compile time known size). All code contained in this block is executed on the target device (careful with function not compilable on the accelerator, i.e. malloc).

Note

The target can be the host itself. In this case most copies, memory mapping and other behavior you would expect if the target as a GPU device, are almost elided. The code shown below should work well on both GPU and CPUs, the user only needs to ask its compiler to target either a GPU or a CPU (see Targeting an architecture). It’s elegant, isn’t it? The OMP_TARGET_OFFLOAD=[MANDATORY|DISABLED|DEFAULT] environment variable (in OpenMP 5.0) can be used to chose between the host or target device (though the host may be the target).

OpenMP introduces its notion of thread. A thread is “an execution entity with a stack and an associated static memory”. A thread has an index. The thread with index 0 is the master thread. A team is a set of threads. A league is a set of teams created by the teams clause.

All threads are part of a team. When a master thread encounters a parallelization clause, zero or more threads are generated from the master thread. The threads of a team can be associated to unit of work defined by a worksharing clause. Such clause also specify how to distribute these units of work on the threads of a team to provide static or load balancing capabilities. So long as a parallelization clause is not encountered we can imagine the child threads as dormant and not participating in the program execution.

A contention group is a set of threads that can synchronize. The threads of a team form a contention group, but two threads from two different teams are not in contention.

Leagues can work on CPU (typically the host machine) but we have no particular advantage in using them. On GPU on the other hand, to exploit the large amount of threads and their hardware distribution in groups (workgroup or thread block in AMD and Nvidia terminology), OpenMP implementations with GPU support associate a team (a group in contention) with a workgroup and create a league of theses teams which matches the concept of grid on the GPUs.

Some examples may help to understand.

In this first example, we see a loop of N iterations. The OpenMP directive states that this loop should be executed on the target device. The control flow is transferred to the target device and a new master thread is created. This thread executes the loop.

const int N = 256;
#pragma omp target
for(int i = 0; i < N; ++i) {
    // Computation
}

In this second example, once the control flow is transferred, the new master thread meets the teams clause and a league (of teams) is created. The default number of leagues (grid size) and number of threads per team (thread block size) is implementation dependent, but these parameters can be specified by the programmer. For all team, the master thread will execute the loop, there is no distribution of work, but redundancy of work.

const int N = 256;
#pragma omp target
#pragma omp teams
for(int i = 0; i < N; ++i) {
    // Computation
}

In this third example, once the league is created, the distribute clause declares that the iterations of the loop must be distributed to the teams. For example, for 32 teams, each team executes 8 iterations. For each team, only the master thread executes the iterations assigned to the team. The work is distributed to all the teams but not to all the threads of a team.

const int N = 256;
#pragma omp target
#pragma omp teams distribute
for(int i = 0; i < N; ++i) {
    // Computation
}

In this fourth example, note the presence of the parallel and for clauses. The parallel clause wakes up the dormant threads of a team. The for clause distributes the work of a team over its threads. Each team executes all the iterations of the loop. There is a redundancy of work.

const int N = 256;
#pragma omp target
#pragma omp teams parallel for
for(int i = 0; i < N; ++i) {
    // Computation
}

In this last example, a syntactic shortcut is used to express parallelism. Note the distribute parallel for clause assembly. This time, the work is distributed over the teams of the league, and over the threads of a team. There is no redundancy of work.

const int N = 256;
#pragma omp target
#pragma omp teams distribute parallel for
for(int i = 0; i < N; ++i) {
    // Computation
}

The notion of team and league allow us to exploit the multi-level parallelism commonly seen in GPGPU. There is a last (debatable on GPU) level corresponding to the use of the simd clause. This level is debatable because the simd clause does not dictate to the compiler to generate SIMD instructions but rather, that a group of iteration (simd chunk) can be processed by a single thread through the use of SIMD instructions without potential side effect to the program’s correctness. On CPUs, this means telling the compiler that for instance, there is no dependency between iteration. But on an (AMD) GPUs, a thread is already a lane of SIMD instruction, it does not mean a lot on GPUs (for the moment). On GPU it could mean the use of the float2, double2, etc., types, so called packed float; these can be used for intra SIMD lane SIMD operations. We will not discuss the simd clause further in this document.

The #pragma omp target loop construct tells the compiler that the iterations can be run concurrently and may be scheduled as wanted by the compiler.

Note

One should try to concatenate the teams clause and distribute parallel for construct into the teams distribute parallel for construct. Else, one exposes himself to bad code generation (what is called generic mode instead of SPMD (SIMT) mode in LLVM terminology, see Flynn’s taxonomy). Now, that issue is fading away with recent compiler optimization (checkout some external LLVM documents in Compiler infrastructure).

As an aside, one can try to mimic the OpenCL, HIP, Cuda and other accelerator programming model using the following OpenMP pattern:

const int                 N          = ...;
static constexpr unsigned blockDim_x = 256; // Left to the user to tune.
static constexpr unsigned gridDim_x  = ...; // Left to the user to tune.

#pragma omp target teams num_teams(gridDim_x) thread_limit(blockDim_x)
#pragma omp parallel
{
    const unsigned grid_dimension = ::omp_get_num_teams() * ::omp_get_num_threads();

    for(unsigned i = ::omp_get_team_num() * ::omp_get_num_threads() + ::omp_get_thread_num();
        i < N;
        i += grid_dimension) {
        // Computation
    }
}

In the above code, we use teams (thread blocks), and ask OpenMP to run all the threads of the teams in parallel, there is no distribution of work, we have to do it ourselves using omp_get_[thread|team]_num() et al.

If one makes sure that gridDim_x * gridDim_x >= N, the code takes an other very common Cuda like form:

const int                 N          = ...;
static constexpr unsigned blockDim_x = 256; // Left to the user to tune.
static constexpr unsigned gridDim_x  = (N + (blockDim_x - 1)) / blockDim_x;

#pragma omp target teams num_teams(gridDim_x) thread_limit(blockDim_x)
#pragma omp parallel
{
    const unsigned i = ::omp_get_team_num() * ::omp_get_num_threads() + ::omp_get_thread_num();

    if(i < N) {
        // Computation
    }
}

These two code snippets would not behave well on CPU, but when it comes to GPU, a very similar machine code should be produced for the OpenMP version and say, an equivalent HIP version of this code. The table below continues on drawing a parallel between OpenMP and accelerator specific languages.

OpenMP snippet	HIP or Cuda equivalent
`#pragma omp allocate(scratch_space) allocator(omp_cgroup_mem_alloc)` [1]	`__shared__` memory (LDS) on LLVM [1] based OpenMP implementations.
`#pragma omp allocate(value) allocator(omp_thread_mem_alloc)`	Thread local value (typically in registers). What you expect when you define a variable.
`#pragma omp barrier`	`__syncthreads()`
`::omp_get_num_teams();`	`gridDim.x`
`::omp_get_team_num();`	`blockIdx.x`
`::omp_get_num_threads();`	`blockDim.x`
`::omp_get_thread_num();`	`threadIdx.x`

Footnotes

Now that you understand how to offload work with OpenMP, you need to understand how to copy data between the host device and the target device. On MI250X and Trento systems like Adastra, LUMI and Frontier, you will have access to a kind of unified memory that exempt you almost entirely from doing memory copies between the two spaces. Unified memory might or might not help you but what is certain is that making use of it in 2023 will prevent you from running your code on systems not providing this capability (most systems), you code will be working on less architectures, it will be less ported (less portable). Thus you must learn how to do memory copies with OpenMP. If in a few years, unified memory becomes ubiquitous, the code modifications to make use of it will only consist in removing the explicit copies (easy).

Observe the example below without bothering about the practical justification.

void DoWork(int* input, size_t size) {
    for(size_t j = 0; j < 10; ++j) {
        #pragma omp target map(tofrom: input[42:size])
        #pragma omp teams distribute parallel for
        for(size_t i = 0; i < size; ++i) {
            input[i + 42] += 1;
        }
    }
}

We observe the map clause. It is made of two part, a memory operation, tofrom in this case, and memory chunk declaration, input[42:size]. First the memory chunk declaration, it should be interpreted as an interval of memory like so [input + 42, input + 42 + size). Note the ) meaning with are dealing with half open ranges (like in C, C++ but not Fortran, see Dijkstra’s note on that point). It is possible to use arithmetic like so: map(tofrom: input[something / 2:(size * 3) - 12]). Now, the tofrom. It is part of a set of operation that can be associated to the map clause, here is the list:

Operation category	Operation	Before code-block execution	After code-block execution
Memory management	`alloc`	Increase the reference counter. If the reference counter is equal to one after the increment, allocate a memory chunk on the target device and associate it to the host memory chunk.	Behavior of `release`.
Memory management	`release`	-	Decrease the reference counter.
Memory management	`delete`	Sets the reference count to zero. The device memory chunk is released.	-
Copy	`to`	Behavior of `alloc` followed by a copy of the host device data `to` the `alloc` ated target device memory chunk if and only if the reference is equal to one.	Behavior of `release`.
Copy	`from`	Behavior of `alloc`.	Retrieval of the data `from` the target device to the host device if and only if the reference is equal to one followed by the behavior of `release`.
Copy	`tofrom`	Behavior of `to`.	Behavior of `from`.

If you have never heard of the concept of reference counting in computer science, know that it is a technic used to keep track of the usage of a resource. It starts at zero when no resource is held, is incremented when resource allocation is requested (the resource is allocated only when the counter is incremented to one but not necessarily when it is equal to one), is decremented when resource release is requested (the resource is de-allocated when the counter equals zero).

OpenMP uses this very common concept to avoid redundant copies. This will be the subject of the following document.

Phase 5: Superfluous memory transfer removal

At this point, all the hot spots exposed by your test cases should have been offloaded on the GPU by doing explicit copies. Your code is probably slower than initially due to the redundant copies. We will now offer a way to progressively remove the redundant copies.

Now, the example given previously is clearly inefficient in multiple ways but note that for each call of DoWork, 18 unnecessary copies are done. Suppose we made it so only two copies are executed, one to, one from ? OpenMP offers an elegant solution not even requiring a modification of the DoWork function. DoWorkLowCopy is a caller of DoWork, if you observe correctly, you will notice a new construct: target data. This allows us to decorrelate the memory operations from a control flow transfer.

void DoWork(int* input, size_t size) {
    for(size_t j = 0; j < 10; ++j) {
        #pragma omp target map(tofrom: input[42:size])
        #pragma omp teams distribute parallel for
        for(size_t i = 0; i < size; ++i) {
            input[i + 42] += 1;
        }
    }
}

void DoWorkLowCopy(int* input, size_t size) {
    #pragma omp target data map(tofrom: input[42:size])
    DoWork(input, size);
}

Assuming there is no OpenMP target memory related operations on input[42:size] prior to the DoWorkLowCopy call, what happens is that before calling DoWork, the to memory operation in the target data pragma of DoWorkLowCopy allocates the data and does the copy from the host to the device. The reference counter of the input[42:size] memory chunk is now equal to one. DoWork is called and when the thread encounters the map clause, OpenMP executes the to but will not allocate nor copy any data as the reference counter is now two. Then happen the control flow transfer, that is, a GPU kernel is launched. OpenMP waits for the kernel’s completion then execute the from, the reference is decremented, and is now equal to one. Nothing is done. So goes the 9 other iteration on j inside of DoWork. When DoWork returns, the from in DoWorkLowCopy’s body is executed and the reference counter being equal to one, the copy is executed, then the decrementation induces a release of the memory chunk from the target device’s memory. We just elided 18 copies! One can proceed in the same way for all memory allocations, trying to minimize their number when appropriate. Sometimes, the user will benefit from copying data as needed without touching the reference count. This can be done using the target update construct with the to or from clauses like so:

#pragma omp target update                    to( byte_array [43:size] )
#pragma omp target update if(was_modified) from( byte_array [43:size] )

The meaning is quite obvious. Theses pragmas are standalone, they obviously do not need to be attached to a code block, scope, function call etc.. Note the if clause which is usable in multiple construct but would be too long to present. Just note that if was_modified is true (!= 0), the memory operation is executed.

Note

There is at least one major problem with the OpenMP paradigm. It strongly implies a buffer duplication, on the host and target device. This needs to be understood and taken care of because Adastra GPU nodes possess 512 Gio of RAM while the CPU part of this kind of nodes has 256 Gio. Now, there are technical tricks to avoid this limitation such as understanding that the virtual memory abstraction does not allocate anything physical as long as you do not touch a memory page.

Note

Ideally, you would allocate CPU memory arrays and never ever manipulate them on the CPU (not even zeroing), always using offloading. In some case this is exactly what you should do, in other case such as when you want to write your results to disk, this is still doable but less practical.

Phase 6: Performance tuning

Refer to the Performance tuning and monitoring document.

From CUDA to HIP

Nvidia is known to offer great hardware coupled with great tooling. There is a hitch though, most of it is closed source. AMD is gaining ground by providing great hardware at lower cost while open sourcing a large part of the tooling. This partially pushes the burden of development on the open source community. There are multiple key technologies that have enabled this. Here we evoke the Heterogeneous-compute Interface for Portability (HIP), a CUDA like API that aim to provide support for at least AMD and Nvidia GPUs. Most of the significant CUDA API is available. The key differences are that the API is prefixed with hip instead of cuda.

The APIs are so similar that a simple text replacement program like hipify-perl can be used. There is an alternative, more complex to setup named hipify-clang which we do not describe in this document but is available here.

hipify-perl

hipify-perl’s notable flags are:

-examine          - Combines -no-output and -print-stats options
-inplace          - Backup the input file in .prehip file, modify the input file inplace

The first option gives you a preview of what is going to happen for instance, given the code below in a file named test.cu:

cudaError_t error_code = cudaFreeHost(p);
error_code = cudaStreamCreateWithFlags(&shtns->xfer_stream, cudaStreamNonBlocking);

When using -examine we obtain:

$ hipify-perl -examine test.cu
[HIPIFY] info: file 'test.cu' statistics:
  CONVERTED refs count: 4
  TOTAL lines of code: 4
  WARNINGS: 0
[HIPIFY] info: CONVERTED refs by names:
  cudaError_t => hipError_t: 1
  cudaFreeHost => hipHostFree: 1
  cudaStreamCreateWithFlags => hipStreamCreateWithFlags: 1
  cudaStreamNonBlocking => hipStreamNonBlocking: 1

When using -inplace we obtain change test.cu and obtain:

$ hipify-perl -inplace test.cu
$ cat test.cu
hipError_t error_code = hipHostFree(p);
error_code = hipStreamCreateWithFlags(&shtns->xfer_stream, hipStreamNonBlocking);
$ cat test.cu.prehip
cudaError_t error_code = cudaFreeHost(p);
error_code = cudaStreamCreateWithFlags(&shtns->xfer_stream, cudaStreamNonBlocking);

Warning

Preprocessor directives that check for CUDA-specific functionality must be replaced with a portable HIP version of the preprocessor directive. For example, if the source code contains a preprocessor directive that checks for the presence of the CUDA FFT, it should be replaced with a check for the HIP FFT. When compiling in the HIP environment, HIP checks the available architecture (Nvidia or AMD) and, as a result, HIP API calls are transparently directed to cuFFT or rocFFT.

Warning

You might need to produce different source code depending on the architecture you are targeting (say, for wavefront details). This can be done using preprocessor definition, see: C preprocessor definitions. For instance, the wavefront (or warp in Nvidia terminology) is not the same on AMD and Nvidia GPUs. For AMD GPUs, it is 64 work-items and on Nvidia GPUs it is 32. You can obtain the warp size at runtime via hipGetDeviceProperties but this is undeniably less useful than knowing it at compile time.

The resulting HIP code can be compiled using the appropriate CC compiler wrapper (CCE or AMD).

Performance tuning and monitoring

Performance tuning is one of the very dark side of computer science where you are to rely on corner cases more than general truths. In this sub-section we present the basic features available to diagnostic a performance problem. We offer guidelines but do know that in practice, the golden word is it depends. That being said, there are some safe bets for performance improvement such as maximizing main memory manipulation efficiency.

Surely there must be a less primitive way of making big changes in the store than by pushing vast numbers of words back and forth through the von Neumann bottleneck. […] programming is basically planning and detailing the enormous traffic of words through the von Neumann bottleneck, and much of that traffic concerns not significant data itself. John Backus in his 1977 ACM Turing Award lecture.

Proper binding, why and how

First and foremost, we provide programs to test the if the binding is correct.

Warning

You are strongly recommended to test your SLURM scripts using theses tool before running in production.

For the GENOA partition: A simple MPI+OpenMP based program (hello_cpu_binding) can be used to clarify the mappings.
For the MI250 partition: An MPI+OpenMP+HIP based program (hello_gpu_binding) can be used to clarify the GPU mappings.

Then, we crudely define some often confusing terms:

Term equivalence	Description
Socket	A place where the a CPU’s chips resides. Potentially comes with its own memory buses/channels, power lanes, PCIe lanes, etc.. It is possible to have more than one on a given motherboard.
CPU	A set of cores (potentially of different kinds) arranged in one or multiple chips, see AMD’s CoreCompleX (CCX). There is typically one per socket.
Core	A physical processing unit (PU). If SMT is disabled, or not used, a core is equivalent to a thread. There is generally multiple cores per CPU.
Simultaneous MultiThreading (SMT) thread, hyperthread, logical thread, hardware thread, thread	The abstraction of a compute resource executing serialized instructions (in appearance). If SMT is enabled, the core’s physical resources can be used by two or more logical threads, through its super-scalar (pipelined) capabilities. SMT cant be een as a way to better utilize the compute resources or to hide memory latency.
Software thread, kernel thread, native thread, thread	Yet an other abstraction above the core. Allows the operation system to switch context. Not to mix with user land threads such as Microsoft’s fibers.

Note

Historically, the CPUs where single core and cores where called CPU. We prefer to clearly split the two concepts.

The cost of accessing shared resources often depends on who is accessing. This uneven cost is called Non Uniform Memory Access (NUMA) though the behavior does not limit itself to Memory. Accessors (say, processors) experiencing similar NUMA cost behaviors are grouped to form a NUMA node. The NUMA domain of a NUMA node represents the resources the NUMA node can access at a cheaper cost than other NUMA nodes. This introduces the notion of NUMA distance which is a metric describing the cost of accessing a domain from a node. In practice, NUMA distances are mostly referred to when talking about RAM access latency between cores or sockets. In this case, this gives a symmetric NxN matrix where N is the NUMA node count. The distances can be obtained with commands such as numactl -H or the core-to-core latency.

For instance on an Adastra compute node we have:

$ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 0 size: 63273 MB
node 0 free: 59481 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 64462 MB
node 1 free: 61615 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 2 size: 64503 MB
node 2 free: 61405 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 3 size: 64487 MB
node 3 free: 61595 MB
node distances:
node   0   1   2   3
  0:  10  12  12  12
  1:  12  10  12  12
  2:  12  12  10  12
  3:  12  12  12  10

Note

Be careful with numactl and its usage of the CPU or physical CPU term. Here, CPU means hardware thread!

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           1
NUMA node(s):        4
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               48
Model name:          AMD EPYC 7A53 64-Core Processor
Stepping:            1
CPU MHz:             2000.000
CPU max MHz:         3541.0149
CPU min MHz:         1500.0000
BogoMIPS:            3992.66
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-15,64-79
NUMA node1 CPU(s):   16-31,80-95
NUMA node2 CPU(s):   32-47,96-111
NUMA node3 CPU(s):   48-63,112-127
Flags:               fpu vme de pse tsc msr pae mce [... redacted for brevity]

Note that SMT is enabled as shown by the line Thread(s) per core: 2. This needs to be taken into account when binding and choosing the number of thread per task and task per node for a given job.

Trans-NUMA memory transactions should be avoided. In practice, trans-cache memory transactions should also to be mitigated. That poses a new question: How to control or avoid accidental trans-NUMA memory access ? This leads us to the concept of memory placement policy which requires some basic knowledge on virtual memory. One key point to understand is that until memory is touched (written to or read from), no physical memory may actually be allocated. When a thread touches a memory page (a chunk of virtual memory), then the kernel will reserve some RAM for the page.

Their are multiple memory placement policy but the one you will see in use on machines such as Adastra is called first touch. The first touch policy gives predictable behavior by saying that when a thread touches a page until now never touched, then the kernel will allocate physical memory from the NUMA domain the touching thread is from (if enough memory is available).

To take advantage of the first touch policy, the programmer should make sure that the threads that initialize the memory or touches it for the first time, belong to the same NUMA domain of the threads that will do the computation. There is a problem tough, because the underlying hardware threads of your process’ threads can change during the program’s execution and potentially, you could end up running on the core of a different NUMA node at the operating system’s will (we could say that the process’ threads move). What we need is to bind or define the affinity of your process towards a set of hardware threads all close to the same hardware resource say a NUMA node or a Last Level Cache (LLC).

The NUMA behavior is also observable on Adastra’s accelerated nodes, where, due to the physical layout of the hardware, the cost of accessing GPU memory or even, to access a network interface not always the same.

Note

Binding software threads to hardware threads is more important on AMD CPUs than on Intel CPUs (that does not mean you shouldn’t do it on Intel CPUs). This can be observed with the core-to-core latency tool and by being aware of AMD’s design by stitching Core CompleX (CCX) arranged in Core Chiplet Die (CCD) for CPUs and stitching Graphic Complex Die (GCD) for GPUs.

Note

Building a software to be NUMA aware is complex and intrusive in the algorithm. A good alternative is to use a mix of OpenMP and a message passing library say MPI (message passing which you are going to need anyway if you want to use multiple nodes). Then, being NUMA aware is as easy as running a process per NUMA, L3 cache, etc. and exchanging between processes using MPI.

Warning

The binding makes the most sense only if whole nodes are used!

After the why comes the how. The problem sorts of limits itself to defining place partitions that is, groups of hardware threads and assigning processes to these places. Multiple tools are available:

numactl

hw-bind

taskset

OpenMP

SLURM

MPI implementation (potentially)

To illustrate the binding tools and concepts, we will work with an example job supposing 8 tasks per node and 8 threads per task on an Adastra accelerated node. Each task gets its own L3 cache and GPU. We will assume SMT is not wanted. The NUMA domains we will use are given in the Adastra accelerated nodes subsections and summarized below:

# Node local rank 0 gets the GCD 0, is bound the cores [48-55] of NUMA domain 3
# Node local rank 1 gets the GCD 1, is bound the cores [56-63] of NUMA domain 3
# Node local rank 2 gets the GCD 2, is bound the cores [16-23] of NUMA domain 1
# Node local rank 3 gets the GCD 3, is bound the cores [24-31] of NUMA domain 1
# Node local rank 4 gets the GCD 4, is bound the cores [ 0- 7] of NUMA domain 0
# Node local rank 5 gets the GCD 5, is bound the cores [ 8-15] of NUMA domain 0
# Node local rank 6 gets the GCD 6, is bound the cores [32-39] of NUMA domain 2
# Node local rank 7 gets the GCD 7, is bound the cores [40-47] of NUMA domain 2

Note

Although we present binding on an accelerated node, the concepts are similar on a scalar node (and simpler too).

OpenMP, numactl or taskset have a common major issue for our use case; they are not task aware. We are going to have multiple tasks per node and we do not want these processes to use the same hardware resources. If all the process had the same binding, they would use the same hardware threads leading to an oversubscription of the hardware threads by process threads. This issue can be circumvented using a small script.

With OpenMP, you can specify a set of place partitions explicitly, by listing the hardware threads in each place partition or, implicitly, by evoking an abstract name (threads, cores, etc.) in the ${OMP_PLACES} environment variable. In practice, using only OpenMP places in a multi process software tends to be clunky and not general enough to work great on a wide range of machine.

We will show a way of hardcoding the binding using numactl. Hardcoding is often a bad practice, but in this case, remember that we are tuning to a specific architecture: the Adastra nodes. Below is an example script giving you near total control of your GPUs and cores and how to bind them to a process.

#!/bin/bash

set -eu

LOCAL_RANK_INDEX="${SLURM_LOCALID}"
LOCAL_RANK_COUNT="${SLURM_NTASKS_PER_NODE}"

function Adastra_MI250_8TasksWith8ThreadsAnd1GPU() {
    AFFINITY_NUMACTL=('48-55' '56-63' '16-23' '24-31' '0-7' '8-15' '32-39' '40-47')
    AFFINITY_GPU=('0' '1' '2' '3' '4' '5' '6' '7')
    export MPICH_OFI_NIC_POLICY=NUMA
}

function Adastra_GENOA_24TasksWith8Threads() {
    AFFINITY_NUMACTL=('0-7' '8-15' '16-23' '24-31' '32-39' '40-47' '48-55' '56-63' '64-71' '72-79' '80-87' '88-95' '96-103' '104-111' '112-119' '120-127' '128-135' '136-143' '144-151' '152-159' '160-167' '168-175' '176-183' '184-191')
}

function Adastra_GENOA_48TasksWith4Threads() {
    AFFINITY_NUMACTL=('0-3' '4-7' '8-11' '12-15' '16-19' '20-23' '24-27' '28-31' '32-35' '36-39' '40-43' '44-47' '48-51' '52-55' '56-59' '60-63' '64-67' '68-71' '72-75' '76-79' '80-83' '84-87' '88-91' '92-95' '96-99' '100-103' '104-107' '108-111' '112-115' '116-119' '120-123' '124-127' '128-131' '132-135' '136-139' '140-143' '144-147' '148-151' '152-155' '156-159' '160-163' '164-167' '168-171' '172-175' '176-179' '180-183' '184-187' '188-191')
}

Adastra_MI250_8TasksWith8ThreadsAnd1GPU
# Adastra_GENOA_24TasksWith8Threads
# Adastra_GENOA_48TasksWith4Threads

CPU_SET="${AFFINITY_NUMACTL[$((${LOCAL_RANK_INDEX} % ${#AFFINITY_NUMACTL[@]}))]}"
if [ ! -z ${AFFINITY_GPU+x} ]; then
    GPU_SET="${AFFINITY_GPU[$((${LOCAL_RANK_INDEX} % ${#AFFINITY_GPU[@]}))]}"
    export ROCR_VISIBLE_DEVICES="${GPU_SET}"
fi
exec numactl --localalloc --physcpubind="${CPU_SET}" -- "${@}"

Note

A more detailed script is given here. Simpler but more volatile SLURM based bindings are presented next.

A few notes:

We define a configuration depending on the hardware, here for the accelerated nodes, it is recommended to use 8 cores per task, each task with its own GCD.
${ROCR_VISIBLE_DEVICES} or ${HIP_VISIBLE_DEVICES} restrict the GPU visibility of the ROCm software stack (your software will see a subset of the node’s GPUs). ${ROCR_VISIBLE_DEVICES} operates at the Heterogenous System Architecture (HSA) interface level while ${HIP_VISIBLE_DEVICES} operates at the HIP runtime level. The HSA is the interface used to drive the GPUs through the kernel’s AMD GPU driver. The AMD driver is called ROCk while the AMD GPU HSA implementation is called ROCt. Both environment variables should work though, to minimize resource allocation, it is generally more efficient to mask the GPUs that you do not want to use at the lowest level.
In case you deem SMT (hyperthreading) worth, see this script.

Storing the script on disk as adastra_acc_binding.sh, using the script above is as simple as doing:

#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="test_affinity"
#SBATCH --constraint=MI250
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --time=1:00:00

srun --ntasks-per-node=8 --cpu-bind=none --mem-bind=none -- ./adastra_acc_binding.sh <executable> <arguments>

Line	Description
1	Shell interpreter line.
2	GENCI/DARI project to charge. More on that in Resource consumption and charging.
3	Job name.
4	Type of Adastra node requested (here, the GPU MI250X partition).
5	Number of compute nodes requested (here, 1).
6	Make sure we allocate whole nodes, our binding requires it (actually it’s not necessary because we already request all the resources of a node).
10	Implicitly ask to use all of the node allocated. Then we distribute the work on 8 tasks per node. We disable the SLURM implicit binding and use the binding script.

You can validate the binding using the tools given at the start of this section.

Now that you have read these lines we must confess that most of these issues can also be solved with SLURM but this solution may be less portable, less flexible, less obvious and more subject to subtle bug or breaking. It is based on srun’s --gpu-bind=closest.

Note

Until further notice, it is not advisable to use gpu-bind. Please use a binding script such as the one given above, and customize the functionalities to your need.

Warning

Do not use --cpus-per-tasks=<N> when using the aforementioned binding script. Else you may encounter the following numactl error message: libnuma: Warning: cpu argument XXX is out of range.

Warning

--gpu-bind=closest rely on the CPU binding. If your CPU binding is not correct, your GPU binding will not be correct too. A commonly seen mistake is using 1 core per task and not spreading the task on, say, the L3 caches. For an 8 tasks per node job on the accelerated partition, you would have all the task on the same L3 and SLURM would give the same GPU to all the tasks (the closest to the L3 all the task resides on). This should change for the better as the SLURM configuration gets more mature.

#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="test_affinity"
#SBATCH --constraint=MI250
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --time=1:00:00

srun --ntasks-per-node=8 --cpus-per-task=8 --threads-per-core=1 --gpu-bind=closest -- <executable> <arguments>

The only difference with the previous sbatch configuration is that we replaced the --cpu-bind=none --mem-bind=none with --cpus-per-task=8 --gpu-bind=closest. This will give you the following binding:

# Node local rank 0 gets the GCD 4, is bound the cores [ 0- 7] of NUMA domain 0
# Node local rank 1 gets the GCD 5, is bound the cores [ 8-15] of NUMA domain 0
# Node local rank 2 gets the GCD 2, is bound the cores [16-23] of NUMA domain 1
# Node local rank 3 gets the GCD 3, is bound the cores [32-39] of NUMA domain 1
# Node local rank 4 gets the GCD 6, is bound the cores [24-31] of NUMA domain 2
# Node local rank 5 gets the GCD 7, is bound the cores [48-55] of NUMA domain 2
# Node local rank 6 gets the GCD 0, is bound the cores [40-47] of NUMA domain 3
# Node local rank 7 gets the GCD 1, is bound the cores [56-63] of NUMA domain 3

This binding should provide similar performance than the one present earlier, except that the GPUs index an MPI rank will see is not the same as his ${SLURM_LOCALID}. If you seek this behavior, assuming you want one 1 core per task, the following sbatch configuration needs to be used (or you can simply rely on the binding script above):

#!/bin/bash
#SBATCH --account=<account_to_charge>
#SBATCH --job-name="test_affinity"
#SBATCH --constraint=MI250
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --time=1:00:00

srun --ntasks-per-node=8 --cpus-per-task=8 --threads-per-core=1 --cpu-bind=map_cpu:48,56,16,24,1,8,32,40 --gpu-bind=closest -- <executable> <arguments>

The only difference with the previous sbatch configuration is the use --cpu-bind=map_cpu:48,56,16,24,1,8,32,40. We need to ask for 8 threads per tasks but the task will only see one. We we do not ask for 8 threads, SLURM will allocate the first [0, 8) threads and the binding, asking for threads outside of that range, will fail. Note that to be even more explicit, one could replace the --gpu-bind=closest with --gpu-bind=map_gpu:0,1,2,3,4,5,6,7 though, note that the CPU and GPU binding must match (see Adastra accelerated nodes).

Proper pinning, why and how

Once each rank uses a correct set of core you should already getting a good portion of what you hardware can give you (all other things being equal). Indeed, the threads spawned by a process are ensured to be on the correct NUMA/L3/meaningful hardware partition thanks to the affinity/binding masks (CPU mask). We have an affinity of our process’s software threads toward a set of hardware thread.

The next step would be to ensure that the software threads spawned by a process are pinned to an hardware thread and do not move inside the set of cores associated to the process. If not done, one exposes himself to issues exacerbated if you use a large amount of software thread per rank (say ~> 16) that crosses hardware boundary (say L2/L3). When a software thread moves from core to core context switch cost and cache misses are small penalties payed that if repeated, can be detrimental to performance.

Note

Software thread pinning a finer level of affinity control. This can entirely supersede binding cores to rank but then becomes less portable.

We propose to succinctly expose how to do software thread pinning using OpenMP assuming a core to rank binding has already been done. In this situation, one only needs to think about its own set of core which abstracts the hardware specificities (handled by the binding script).

Below, the two environment variable needed to apply what was described above.

export OMP_PROC_BIND=CLOSE
export OMP_PLACES=[THREADS|CORES]

First, via OMP_PROC_BIND=CLOSE, we explicit that the threads spawned by the process should stay close to the process’ initial (master) thread’s place. Places are part of a place partition. A place partition is just a set of places. A place is an OpenMP concept representing something on which a software thread can execute. You can control explicitly how the place partitions are defined or implicitly using names such as THREADS or CORES. The THREADS name asks OpenMP to create as many place partitions as there is hardware thread bound to the process; each place partition contains a single place. The CORES name asks OpenMP to create as many place partitions as there is cores bound to the process; each place partition contains as many hardware thread as a core can offer (2 on Genoa CPUs).

On Adastra’s CPUs, the following is recommended:

export OMP_PROC_BIND=CLOSE
export OMP_PLACES=THREADS

GPU specific

Monitoring the GPUs

One could use a mix of htop and rocmsmi.

GPU profiling

One could reuse Cray’s perftools but we prefer describing an other tool dubbed rocprof. Check this document for more details.

Guidelines

Porting to a new, significantly different architecture might be long if you haven’t planned for it (on the order of months). That being said, if the tools you use allow it, say you use C++, you can rely on abstraction layer that can soothe the painful work. Always remember to break apart the data and its treatment (viewed as kernel/operators on the data) and use the Structure Of Array (SOA) or Array Of Vectorized Struct (AOVS) layouts. This will allow you to change at will the way your data is allocated and how it is manipulated giving you great flexibility. You may also find the Algorithm and generic programming document interesting if you lack generic programming and algorithmic knowledge.

Now we present some guidelines which could help deduce a performance issue.

Cache discrepancy with Nvidia GPUs - While MI250X cards offer very large HBM throughput and a low amount of cache with large throughput, Nvidia cards offer large amount of cache (> 40 MiB). Kernel doing lots of random access operations may find AMD GPU less performant.
NUMA bindings - MPI task, CPU core and GPUs need to be appropriately mapped to each others to reduce inter-ship communication. See the Proper binding, why and how paragraph.
Beware of big kernels - It is often seen that scientific code developers produce very long function that tend to be multipurpose. Try to split these functions.
Register spilling - When the compiler can not afford to allocate more register for a given thread, the compiler will choose to dump a register’s content to the stack. Vaguely speaking, on GPU there is no stack memory, only registers, the stack is mapped on registers. Non inlined function calls for that matter, are not efficient. On AMD GPUs, a memory address space called scratch space is used to dump the data not fitting in the registers. This scratch space is part of the global memory and quite slow although it may fit in the cache. Try to reduce the usage of what would live on the stack, most of the time, static arrays are costly. Try to bring variable usage and definition close together. Spilling can be controlled with the __launch_bounds__ compiler intrinsic.
Atomic operations - Avoid global memory atomic operations (if you can). Be careful with LDS atomic operations. Atomic addition operations have hardware support on recent GPUs but it is not always used. See figure 2 below for more details regarding CDNA 1 and CDNA 2.

../_images/atomicavai.PNG — Instructions generated according to the type of atomic addition

Note that some atomic operations can be implemented as Compare And Swap (CAS) loops. This is very detrimental to performance.

Instruction Level Parallelism (ILP) - ILP takes at least two forms on GPU: wavefront scheduling, and instruction pipelining (but note that GPUs do not have ROB/OoO execution). Wavefront scheduling is linked to occupancy and is often glorified as THE way to improve kernel performance by hiding memory latencies. In practices, if your kernel can reuse its reserved resources (register, LDS, etc.) to process multiple inputs/outputs, to amortize setup and tear-down costs, it can achieve better performance at lower occupancy. The ILP concept of not stalling the pipeline due to dependency chains is similar to the one we know on CPUs and depends strongly on the instruction latency and throughput which depends on the microarchitecture.
Thread divergence - AMD GPUs are more susceptible to thread divergence due to having 64 thread per wavefront/warp compared to 32 for Nvidia. One could observe performance losses if threads take different execution paths.
Non coherent reads (__ldg) - Try to help the compiler generate better load/write instructions by specifying your buffer as restricted. (On CDNA2, these loads have no special effects.)
Don’t forget about the LDS - If you need caching, do not expect the GPU to do it for you, even if it somewhat succeed you are probably loosing on performance versus using the LDS.
Avoid LDS bank conflicts - Knowing that the CDNA2 has 32 LDS banks per CU and that a bank holds values of 4 bytes and assuming you have an array like __shared__ double scratch_pad[16][16]; and the threads all access it like so (or similarly) scratch_pad[threadIdx.x][0];scratch_pad[threadIdx.x][1];...;scratch_pad[threadIdx.x][15];, you will get poor performance (x16 slowdown maybe). You can easily fix the issue by adding a padding value like so: __shared__ double scratch_pad[16][16 + 1];.
Avoid passing large arguments by-value to __global__ functions.
Avoid excessive unrolling via #pragma unroll.
Use GPU aware and direct communication libraries, Cray MPICH support such behaviors.
Avoid excessive precision. Most GPU can do Binary32 (float) operation at higher rate per clock than Binary64 (double). This is the case for the MI250X but also and most significantly for commodity hardware (gaming GPUs for which the rate is on the order of 1:16 ~~ in favor of Binary32). If your algorithm do not require the dynamic range (exponent) of Binary64 or the precision (mantissa) of Binary64, use Binary32. for some application, you can go as low as doing computation in Binary32 and storing the result in Binary16 or BF16.
Do not assume the compiler will unroll.
Do not assume the compiler will avoid register reloads. Mark your pointers __restrict or store loaded values in scalar to avoid unnecessary reloading.

Shenanigans

Peak Binary32 Flop/s on CDNA2 with `v_pk_fma_f32`

As mentioned in the Adastra architecture document, it is possible to use SIMD operation inside a GPU thread (SIMD lane) to reach +40 Binary32 TFlop/s. Unfortunately, as of ROCm 5.7.1, amdclang does not generate the v_pk_fma_f32 instruction allowing us to reach such throughput. Below is an inline assembly wrapper around this operation.

/// Compute x = x * y + y
///
__device__ static inline void
mad(float2& x, const float2& y) {
    asm volatile("v_pk_fma_f32 %0, %0, %2, %2"
                // %0
                : "=v"(x)
                // %1      %2
                : "0"(x), "v"(y));
}

Porting & optimization

Porting to GPU

GPUs, pros and cons

Available programming technologies

Specialized computation libraries

Intra-process parallelization technologies

Opinionated default choice

Message passing library

Minimal viable speedup

Porting Process

From CPU only or OpenACC offloading to OpenMP offloading

Phase 1: Planning

Phase 2: Test cases and porting the compilation

Phase 3: CPU Profiling

Cray-PAT’s perftools

Using perf

Phase 4: Hot spot porting

Phase 5: Superfluous memory transfer removal

Phase 6: Performance tuning

From CUDA to HIP

hipify-perl

Performance tuning and monitoring

Proper binding, why and how

Proper pinning, why and how

GPU specific

Monitoring the GPUs

GPU profiling

Guidelines

Shenanigans

Peak Binary32 Flop/s on CDNA2 with v_pk_fma_f32

Cray-PAT’s `perftools`

Using `perf`

Peak Binary32 Flop/s on CDNA2 with `v_pk_fma_f32`