Adastra’s architecture
System overview
Adastra is a French supercomputer hosted at CINES, a Tier 1 computing site located in Montpellier. The Adastra supercomputer is an HPE-Cray EX system, combined with two ClusterStor E1000 storage systems. A simple architectural diagram is shown below:

Adastra’s hardware
From the application developer’s point of view, a HPE-Cray system is a tightly integrated network of thousands of nodes. Some are dedicated to administrative or networking functions and therefore off-limits to application programmers. Programmers typically use the following node types:
Login nodes: The node you access when you first log in to the system. Login nodes offer the full HPE-Cray Programming Environment (CrayPE or CPE) operating system and are used for basic development tasks such as editing files and compiling code. the login nodes are a shared resources that may be used concurrently by multiple users. Login nodes are also sometimes called service nodes.
Compute nodes: The nodes on which production jobs are executed. Compute nodes can be accessed only by submitting jobs through a batch management system (i.e., SLURM, PBS, LSF). They generally have access to a high-performance parallel file system and can be dedicated resources, exclusively yours for the duration of the batch reservation. When new users first begin working on the such system, this difference between login and compute nodes can be confusing. Remember, when you first log in to the system, you are placed on a login node. You cannot execute parallel programs on the login node. Instead, use Adastra’s batch system to place parallel programs on the compute nodes.
Below is a list of the hardware Adastra is made of:
544 scalar nodes (2 AMD Genoa EPYC 9654 96 cores 2.4 GHz processors (3.7 GHz boost), 768 Gio of DDR5-4800 MHz memory per node (4 Gio/core), 1 Slingshot 200 Gb/s Network Interface Card (NIC));
356 accelerated nodes specialized for General Purpose computation on GPUs (GPGPU) computations (1 AMD Trento EPYC 7A53 64 cores 2.0 GHz processor with 256 Gio of DDR4-3200 MHz CPU memory per node, 4 Slingshot 200 Gb/s NICs, 8 GPUs devices (4 AMD MI250X accelerator, each with 2 GPUs) with a total of 512 Gio of HBM2 per node);
28 accelerated nodes specialized for General Purpose computation on GPUs (GPGPU) computations (4 Slingshot 200 Gb/s NICs, 4 APU devices (4 AMD MI300A accelerator) with a total of 512 Gio of HBM2 per node);
12 visualization and pre/post processing nodes (2 Genoa 96 cores 2.4 GHz processors, 2048 Gio of DDR5-4800 MHz memory per node, 1 Slingshot 200 Gb/s NICs and 2 NVIDIA L40 graphic cards);
A Slingshot interconnection network;
10 front-end and transfer nodes (2 AMD Genoa EPYC 9654 96 cores 2.4 GHz processors (3.7 GHz boost), 512 Gio of DDR5-4800 MHz memory per node, 1 Slingshot 200 Gb/s NICs and 4 x 1.6 Tio SAS MU SSDs configured in RAID10);
1 ClusterStor E1000 SSD for LUSTRE storage space home: 125 Tio capacity; 77 Gio/s read and 34 Gio/s write throughput;
1 ClusterStor E1000 SSD for LUSTRE storage space scratch: 1.89 Pio capacity; 1086 Gio/s read and 786 Gio/s write throughput.
Note
Scalar node refers to a CPU only node. An accelerated node contains accelerators (MI250X or MI300A in the case of Adastra) and potentially also a CPU (which may be part of the same chip than the accelerator leading to something like an APU).
The compute nodes are housed in water-cooled HPE-Cray EX4000 cabinets. These cabinets carry the compute blades which, depending on the technology it contains, includes either:
4 scalar compute nodes (CPU only nodes);
2 accelerated compute nodes (CPU host + accelerator).

The HPE-Cray EX4000 cabinet also includes network modules that connect the compute nodes to the Slingshot network. Each HPE-Cray EX4000 cabinet contains a maximum of 64 modules (i.e., a maximum of 256 scalar nodes and a maximum of 128 accelerated nodes).
The Adastra has four cabinets with 64 accelerated nodes and 128 scalar nodes. The last cabinet contains the remaining accelerated nodes and scalar nodes. This makes 356 accelerated nodes and 544 scalar nodes. An other cabinet stores the 28 APU nodes.
Adastra accelerated (MI250X) nodes
Each of such Adastra accelerated compute node consists of one AMD Trento EPYC 7A53 64 cores 2.0 GHz processor and four AMD Instinct MI250X (code name gfx90a/Aldebaran, microarchitecture CDNA 2) accelerators as shown in the figure below. The host CPU socket has access to 256 Gio of DDR4-3200 MHz memory with 2 logical threads per physical core. The MI250X accelerator is a Multi-Chip Module (MCM) and comes with 2 Graphics Compute Dies (GCDs), for a total of 8 GCDs per node. A GCD can be seen as a GPU. The user can think of the 8 GCDs as 8 somewhat separate GPUs, each having 64 Gio of High-Bandwidth Memory (HBM2E). This makes a total node memory of 768 Gio (256+512) for each accelerated node. On Adastra The MI250X is OAM packaged.

The IO die’s Global Memory Interface (GMI) links are used by the Core Chiplet Dies (CCDs) of the Trento CPU to communicate and maintain cache coherency. The eXternal GMI (or Inter-chip Global Memory Interconnect, xGMI) links are used for chip peer to peer communications (GPU/GPU, GPU/CPU CPU/GPU or CPU/CPU). The GMI and xGMI are the backbone of AMD’s Infinity Fabric (IF). A fabric can be seen as an abstraction of the underlying communication hardware. AMD’s IF consists of two separate communication planes, the Infinity Scalable Data Fabric (SDF) and the Infinity Scalable Control Fabric (SCF). xGMI can be seen as a pumped up PCIe. It provides, at equivalent PCIe generation, slightly better transaction per second and (seemingly) better line code efficiency. xGM2 is ~ 18 GT/s, xGMI3 is ~ 25 GT/s and xGMI 4 ~ 32 GT/s. In a similar vein, the PCIe Extended Speed Mode (ESM) allows for higher throughput over similar PCIe 4 lane count and hardware due to higher PCIe clock. In fact, ESM is a side effect of the Cache Coherent Interconnect for Accelerators (CCIX) protocol which is used by AMD and HPE-Cray on Adastra’s nodes. The CCIX consortium standardized ESM to achieve up to 25 GT/s over traditional PCIe 4 (16 GT/s) lanes.
The CPU socket is connected to each GCD via Infinity Fabric over x16 xGMI2, allowing a theoretical host-to-device (H2D) and device-to-host (D2H) throughput of 36 Go/s (33.53 Gio/s). The 2 GCDs on the same MI250X are connected via Infinity Fabric over 4 x16 xGMI3 links for a theoretical throughput of 200 Go/s (186.3 Gio/s) (in each directions simultaneously). The GCDs on different MI250X are connected with Infinity Fabric GPU-GPU in the arrangement shown in the Adastra MI250X node diagram below. The MI250X are each connected to one NIC via PCIe4 + ESM.
Note
We use the Graphic Processing Unit (GPU) term where in fact, we should really use the accelerator term or in the specific case of the MI250X, the GCD term instead. Indeed, the MI250X accelerated nodes use accelerators which can be seen as GPU without the graphic specific parts (so not a Graphic PU..).
Adastra contains a total of 1424 AMD MI250X. The AMD MI250X has a theoretical peak scalar performance, that is without Matrix Fused Multiply Add (MFMA), of 47.9 TFlop/s in double-precision or simple-precision divided evenly between the two GCDs. With a base clock of 1.7 GHz, we get a theoretical peak performance of 23.9 TFlop/s per GCD in double-precision or simple-precision (1.7 GHz * 1 FMA/cycle * 2 SIMD operation/FMA * 16 scalar Binary64 operations/SIMD operation * 4 SIMD ALU * 110 CU) leading to 191.6 TFlop/s double-precision per node. Each GCD has 110 Compute Units (CU), and the memory can be accessed at a peak 1.6 Tio/s. The detailed NUMA diagram shown below can be used to manage the CPU-GPU binding of MPI tasks on a node.
Note
If your code can use packed simple-precision floating point representation (float2
, float4
), one can get double the compute performance, that is, 47.9 TFlop/s simple-precision per GCD for a total of 383.2 TFlop/s single-precision per node.
Note
Do use Flop/s/W or Flop/J but not Flop/s/J.
Binding can be extremely significant on Adastra (especially if you do a lot of CPU/GPU copy or use Unified Shared Memory (USM)) and the user should understand how to properly define it. For this reason understanding the diagram below is crucial to correctly using the hardware. We provide many rank to core and GPU binding configurations in this document and explain how to make use of them in this document. Also, see the Understanding the AMD MI250X GPU and Proper binding, why and how documents.


The CPUs and GPUs NUMA node latency is given below. These numbers do not represent the throughput, they can be compared relatively but a number x2 larger than another does not necessarily mean the latency is two times larger.

The throughput measured using the System Direct Memory Access (SDMA) engines (export HSA_ENABLE_SDMA=1
).

The throughput measured without the SDMA engines (export HSA_ENABLE_SDMA=0
).

Adastra accelerated (MI300A) nodes
Each of such Adastra accelerated compute node consists in four AMD Instinct MI300A (code name gfx942, microarchitecture CDNA 3) accelerators. The four cards are called Accelerated Processing Unit (APU) and each comes with both 24 CPU cores and 228 CDNA 3 Compute Unit (CU) on the same die. The memory on these nodes is exclusively HBM and is accessible directly and without copy, by both the CPU cores and the GPU CU. The MI300A is a natural extension to the MCM which is the MI250X. While for the MI250X, producing an IO die that was fast enough to fuse the two GCDs as if they were one big GPU was not feasible, this was rectified on the MI300A/X. Each MI300A is shown as one GPU by rocm-smi
.
The AMD MI300A has a theoretical peak scalar performance, that is without Matrix Fused Multiply Add (MFMA), of 61.3 TFlop/s double-precision (2.1 GHz * 1 FMA/cycle * 2 SIMD operation/FMA * 16 scalar Binary64 operations/SIMD operation * 4 SIMD ALU * 228 CU) or 122.6 TFlop/s simple-precision (implicit float2 packing). This makes 245.1 TFlop/s double-precision per node. Each MI300A has 228 Compute Units (CU), and the memory can be accessed at a peak 5.3 Tio/s. The detailed NUMA diagram shown below can be used to manage the CPU-GPU binding of MPI tasks on a node.
Binding can be extremely significant on Adastra (especially if you do a lot of CPU/GPU copy or use Unified Shared Memory (USM)) and the user should understand how to properly define it. For this reason understanding the diagram below is crucial to correctly using the hardware. We provide many rank to core and GPU binding configurations in this document and explain how to make use of them in this document. Also, see Proper binding, why and how documents.

The CPUs and GPUs NUMA node latency is given below. These numbers do not represent the throughput, they can be compared relatively but a number x2 larger than another does not necessarily mean the latency is two times larger.

Adastra scalar nodes (GENOA)
The scalar nodes are each equipped with two AMD Genoa EPYC 9654 96 cores 2.4 GHz processors. Compared to Zen 3, the Zen 4 architecture now supports AVX-512 though not at full speed (it emulates AVX-512 using AVX2). With a base clock of 2.4 GHz, we get a theoretical peak performance of 38.4 GFlop/s per core in double-precision (2.4 GHz * 2 FMA/cycle * 0.5 AVX2 emulation * 2 SIMD operation/FMA * 8 scalar Binary64 operations/SIMD operation) leading to 7.373 TFlop/s double-precision per node. The bandwidth is estimated at 460.8 Go/s per CPU (921.6 Go/s per node, 4800 MT/s * 12 channels * 8 bytes * 2 sockets ~= 921 Go/s or 857 Gio/s). In practice we reach 704 Gio/s (82% of peak) on FMAs using at least 48 cores spread over the two sockets’ L3. The CPUs are made up of 4 Core Chiplet Die (CCD) interconnected with a central IO die. Each CCD contains 3 Core CompleX (CCX) each containing 8 cores. Each core provides 2 logical threads. Each CCX gets its own 32 Mio L3 cache and each core gets its own 1024 Kio L2 cache.
Note
Assuming 100 W for the DRAM, 360 W per socket and 125 W for other components, we have 100 + 125 + 360 * 2 = 945 W per scalar node. We obtain a maximum efficiency of 7.373 TFlop/s / 945 W = 7.8 GFlop/J. Assuming 70 W for the DRAM, 180 W per socket, 560 W for each MI250 and 180 W for other components, we have 70 + 180 + 180 + 560 * 4 = 2670 W per accelerated node. We obtain a maximum efficiency of 191.6 TFlop/s / 2670 W = 71.8 GFlop/J. One could argue the accelerated nodes are 9 times more efficient than their same generation scalar node equivalent.
The diagram below (click on the picture to see it more clearly) represents the arrangement of the cores and caches hierarchy for the 2 AMD Genoa EPYC 9654 making each of Adastra’s Genoa nodes. Note that the number of core is not a power of two, which for some core may lead to imbalance. We provide many rank to core binding configurations in this document and explain how to make use of them in this document. Also, see the Understanding the Zen 4 Genoa CPU and Proper binding, why and how documents.

The relationship between core, CCX and CCDs is given on the figure below.

The preprocessing and postprocessing nodes (HPDA)
The 12 preprocessing and postprocessing nodes also called High Performance Data Analytics (HPDA) are based on 2U HPE-Cray Proliant DL385 Gen11 servers with 2 AMD Genoa processors. Each processor has 96 cores @ 2.4 GHz. Each node has 2 NVIDIA L40 graphics cards attached for handling the pre/post processing workloads. Every such node has 2048 Gio of RAM configured in 16 memory channels with 2 x 64 Gio DDR5-4800 MHz memory arrays per channel. For inter-node communications, each node has 1 Slingshot 200 Gb/s network interfaces and a dual-port 10 Gb/s Ethernet card. Theses nodes will be connected to the rest of the machine via the Slingshot fabric and to the CINES facilities via Arista edge routers (see the diagram above).
System interconnect
The Adastra accelerated nodes are connected with 4 HPE-Cray Slingshot 200 Gb/s (25 Go/s, 23.3 Gio/s) NICs providing a total node-injection bandwidth of 800 Gb/s (100 Go/s, 93.1 Gio/s). Thus, each MI250X accelerator is connected to a NIC to facilitate GPU-GPU Remote Direct Memory Access (RDMA). The GPUs are directly connected to HPE-Cray’s Slingshot fabric which allows MPI communications operations such as send or receive to be executed directly from the GPU memory and across the network directly without interaction with the host CPU thus, improving throughput and latency by reducing redundant copies for codes that communicates intensively with other nodes.
Operating system
Adastra runs on the Red Hat Enterprise Linux (RHEL) 8.8 (Ootpa) (cat /etc/os-release
) operating system. We use the Linux 4.18.0 kernel (uname -a
).
File systems
Adastra is connected to the site-wide scratch LUSTRE filesystem providing 1.9 Pio of storage capacity with a measured peak read speed of 1086 Gio/s. It also has access to a home LUSTRE filesystem which provides and also to a work LUSTRE filesystem to keep data between jobs and a store LUSTRE filesystem to keep data and programs for a longer time (between allocation). See Accessing the storage areas for more details.
How many hours to ask for
Depending on your computation, you may be bound by the half precision, Binary32 or Binary64 floating point ALU throughput or by the HBM throughput. Below, we offer a comparison of the commonly found GPU resources as of 2024/04.

For instance, we observe that a MI250X GCD’s Binary64 throughput is 23.95 TFlop/s and that it consumes 280 W. Also, for the node normalized values, we assume we can pack 8 of such MI250X GCD in one node (which is what we have on Adastra).
As another example, for the node normalized values, we observe that an A100 equipped node is offering x1.24 the bandwidth of a node equipped with MI250Xs. Also, a H100 node’s Binary16 throughput is x3.16 that of a A100 node.
From this, assuming your code is memory bound, running on V100 and that you want to run MI250X, you can expect a x1.82 node-to-node memory throughput speedup. All things being equals, if you used to ask for 100000 V100 GPU hours, you would ask for 100000/1.82~=55000 MI250X GPU hours.
Note
This computation does not take the DARI’s normalization factors into account as from a technical standpoint, they are debatable.
We recommend that you familiarize yourself with the concept of Minimal viable speedup. This could help you choose how to compare a CPU node versus a GPU node. This comparison is often badly done and biased toward the GPUs being much better than what they can actually provide (for a well made CPU code).