Adastra’s architecture

System overview

Adastra is a French supercomputer hosted at CINES, a Tier 1 computing site located in Montpellier. The Adastra supercomputer is an HPE-Cray EX system, combined with two ClusterStor E1000 storage systems. A simple architectural diagram is shown below:

Adastra’s hardware

From the application developer’s point of view, a HPE-Cray system is a tightly integrated network of thousands of nodes. Some are dedicated to administrative or networking functions and therefore off-limits to application programmers. Programmers typically use the following node types:

Login nodes: The node you access when you first log in to the system. Login nodes offer the full HPE-Cray Programming Environment (CrayPE or CPE) operating system and are used for basic development tasks such as editing files and compiling code. the login nodes are a shared resources that may be used concurrently by multiple users. Login nodes are also sometimes called service nodes.
Compute nodes: The nodes on which production jobs are executed. Compute nodes can be accessed only by submitting jobs through a batch management system (i.e., SLURM, PBS, LSF). They generally have access to a high-performance parallel file system and can be dedicated resources, exclusively yours for the duration of the batch reservation. When new users first begin working on the such system, this difference between login and compute nodes can be confusing. Remember, when you first log in to the system, you are placed on a login node. You cannot execute parallel programs on the login node. Instead, use Adastra’s batch system to place parallel programs on the compute nodes.

Below is a list of the hardware Adastra is made of:

536 scalar nodes (2 AMD Genoa EPYC 9654 96 cores 2.4 GHz processors (3.7 GHz boost), 768 Gio of DDR5-4800 MHz memory per node (4800 MT/s * 12 channels * 8 bits * 2 sockets ~= 900 Gio/s of DRAM throughput), 1 Slingshot 200 Gb/s Network Interface Card (NIC));
338 (+2 backup) accelerated nodes specialized for General Purpose computation on GPUs (GPGPU) computations (1 AMD Trento EPYC 7A53 64 cores 2.0 GHz processor with 256 Gio of DDR4-3200 MHz CPU memory per node, 4 Slingshot 200 Gb/s NICs, 4 AMD MI250X with 512 Gio of HBM2 per node);
12 visualization and pre/post processing nodes (2 Genoa 64 cores 2.4 GHz processors, 2048 Gio of DDR5-4800 MHz memory per node, 2 Slingshot 200 Gb/s NICs and 3 x 3.2 Tio SAS MU SSDs configured in RAID10 and 2 NVIDIA L40 graphic cards);
A Slingshot interconnection network;
10 front-end and transfer nodes (2 AMD Genoa EPYC 9654 96 cores 2.4 GHz processors (3.7 GHz boost), 512 Gio of DDR5-4800 MHz memory per node, 2 Slingshot 200 Gb/s NICs and 4 x 1.6 Tio SAS MU SSDs configured in RAID10);
1 ClusterStor E1000 SSD for LUSTRE storage space home: 125 Tio capacity; 77 Gio/s read and 34 Gio/s write throughput;
1 ClusterStor E1000 SSD for LUSTRE storage space scratch: 1.89 Pio capacity; 1086 Gio/s read and 786 Gio/s write throughput.

Note

Scalar node refers to a CPU only node. An accelerated node contains accelerators (MI250X in the case of Adastra) and potentially also a CPU (which may be part of the same chip than the accelerator leading to something like an APU).

The compute nodes are housed in water-cooled HPE-Cray EX4000 cabinets. These cabinets carry the compute blades which, depending on the technology it contains, includes either:

4 scalar compute nodes (CPU only nodes);
2 accelerated compute nodes (CPU host + accelerator).

The HPE-Cray EX4000 cabinet also includes network modules that connect the compute nodes to the Slingshot network. Each HPE-Cray EX4000 cabinet contains a maximum of 64 modules (i.e., a maximum of 256 scalar nodes and a maximum of 128 accelerated nodes).

The Adastra has four cabinets with 64 accelerated nodes and 128 scalar nodes. The last cabinet with 82 accelerated nodes and 24 scalar nodes. This makes 338 accelerated nodes and 536 scalar nodes.

Adastra accelerated nodes

Each Adastra accelerated compute node consists of one AMD Trento EPYC 7A53 64 cores 2.0 GHz processor and four AMD Instinct MI250X (code name gfx90a, microarchitecture CDNA 2) accelerators as shown in the figure below. The host CPU has access to 256 Gio of DDR4-3200 MHz memory with 2 logical threads per physical core. The MI250X accelerator is a Multi-Chip Module (MCM) and comes with 2 Graphics Compute Dies (GCDs), for a total of 8 GCDs per node. A GCD can be seen as a GPU. The user can think of the 8 GCDs as 8 somewhat separate GPUs, each having 64 Gio of High-Bandwidth Memory (HBM2E). This makes a total node memory of 768 Gio (256+512) for each accelerated node.

An MI250X accelerator, exposing the two GCDs.

The IO die’s Global Memory Interface (GMI) links are used by the Core Chiplet Dies (CCDs) of the Trento CPU to communicate and maintain cache coherency. The eXternal GMI (xGMI) links are used for chip peer to peer communications (GPU/GPU, GPU/CPU CPU/GPU or CPU/CPU). The GMI and xGMI are the backbone of AMD’s Infinity Fabric (IF). A fabric can be seen as an abstraction of the underlying communication hardware. AMD’s IF consists of two separate communication planes, the Infinity Scalable Data Fabric (SDF) and the Infinity Scalable Control Fabric (SCF). xGMI can be seen as a pumped up PCIe. It provides, at equivalent PCIe generation, slightly better transaction per second and (seemingly) better line code efficiency. xGM2 is ~ 18 GT/s, xGMI3 is ~ 25 GT/s and xGMI 4 ~ 32 GT/s. In a similar vein, the PCIe Extended Speed Mode (ESM) allows for higher throughput over similar PCIe 4 lane count and hardware due to higher PCIe clock. In fact, ESM is a side effect of the Cache Coherent Interconnect for Accelerators (CCIX) protocol which is used by AMD and HPE-Cray on Adastra’s nodes. The CCIX consortium standardized ESM to achieve up to 25 GT/s over traditional PCIe 4 (16 GT/s) lanes.

The CPU is connected to each GCD via Infinity Fabric over x16 xGMI2, allowing a theoretical host-to-device (H2D) and device-to-host (D2H) throughput of 36 Gio/s. The 2 GCDs on the same MI250X are connected via Infinity Fabric over 4 x16 xGMI3 links for a theoretical throughput of 200 Gio/s (in each directions simultaneously). The GCDs on different MI250X are connected with Infinity Fabric GPU-GPU in the arrangement shown in the Adastra MI250X node diagram below. The MI250X are each connected to one NIC via PCIe4 + ESM.

Note

We use the Graphic Processing Unit (GPU) term where in fact, we should really use the GCD term instead. Indeed, the accelerated nodes use accelerators which can be seen as GPU without the graphic specific parts.

Adastra contains a total of 1352 AMD MI250X on its 338 accelerated nodes. The AMD MI250X has a theoretical peak scalar performance, that is without Matrix Fused Multiply Add (MFMA), of 47.9 TFlop/s in double-precision or simple-precision divided evenly between the two GCDs. With a base clock of 1.7 GHz, we get a theoretical peak performance of 23.9 TFlop/s per GCD in double-precision or simple-precision (1.7 GHz * 1 FMA/cycle * 2 SIMD operation/FMA * 16 scalar Binary64 operations/SIMD operation * 4 SIMD ALU * 110 CU) leading to 191,6 TFlop/s per node. Each GCD has 110 Compute Units (CU), and the memory can be accessed at a peak 1.6 Tio/s. The detailed NUMA diagram shown below can be used to manage the CPU-GPU binding of MPI tasks on a node.

Note

If your code can use packed simple-precision floating point representation (float2, float4), one can get double the compute performance, that is, 47.9 TFlop/s simple-precision per GCD for a total of 383.2 TFlop/s per node.

Note

Do use Flop/s/W or Flop/J but not Flop/s/J.

Binding can be extremely significant on Adastra and the user should understand how to properly define it. For this reason understanding the diagram below is crucial to correctly using the hardware. We provide many rank to core and GPU binding configurations in this document and explain how to make use of them in this document. Also, see the Understanding the AMD MI250X GPU and Proper binding, why and how documents.

Adastra MI250X node architecture diagram

The CPUs and GPUs NUMA node latency is given below. These numbers do not represent the throughput, they can be compared relatively but a number x2 larger than another does not necessarily mean the latency is two times larger.

The throughput measured using the System Direct Memory Access (SDMA) engines (export HSA_ENABLE_SDMA=1).

The throughput measured without the SDMA engines (export HSA_ENABLE_SDMA=0).

../_images/bardpeak_throughput_no_sdma.png

Adastra scalar nodes

The scalar nodes are each equipped with two AMD Genoa EPYC 9654 96 cores 2.4 GHz processors. Compared to Zen 3, the Zen 4 architecture now supports AVX-512 though not at full speed (it emulates AVX-512 using AVX2). With a base clock of 2.4 GHz, we get a theoretical peak performance of 38.4 GFlop/s per core in double-precision (2.4 GHz * 2 FMA/cycle * 0.5 AVX2 emulation * 2 SIMD operation/FMA * 8 scalar Binary64 operations/SIMD operation) leading to 7.373 TFlop/s double-precision per node. The bandwidth is estimated at 460.8 Gio per CPU (921.6 per node). The CPUs are made up of 4 Core Chiplet Die (CCD) interconnected with a central IO die. Each CCD contains 3 Core CompleX (CCX) each containing 8 cores. Each core provides 2 logical threads. Each CCX gets its own 32 Mio L3 cache and each core gets its own 1024 Kio L2 cache.

Note

Assuming 100 W for the DRAM, 360 W per socket and 125 W for other components, we have 100 + 125 + 360 * 2 = 945 W per scalar node. We obtain a maximum efficiency of 7.373 TFlop/s / 945 W = 7.8 GFlop/J. Assuming 70 W for the DRAM, 180 W per socket, 560 for each MI250 and 180 W for other components, we have 70 + 180 + 180 + 560 * 4 = 2670 W per accelerated node. We obtain a maximum efficiency of 191,6 TFlop/s / 2670 W = 71.8 GFlop/J. One could argue the accelerated nodes are 9 times more efficient than their same generation scalar node equivalent.

The diagram below (click on the picture to see it more clearly) represents the arrangement of the cores and caches hierarchy for the 2 AMD Genoa EPYC 9654 making each of Adastra’s Genoa nodes. Note that the number of core is not a power of two, which for some core may lead to imbalance. We provide many rank to core binding configurations in this document and explain how to make use of them in this document. Also, see the Understanding the Zen 4 Genoa CPU and Proper binding, why and how documents.

The preprocessing and postprocessing nodes

The 12 preprocessing and postprocessing nodes also called High Performance Data Analytics (HPDA) are based on 2U HPE-Cray Proliant DL385 Gen11 servers with 2 AMD Genoa processors. Each processor has 64 cores @ 2.4 GHz. Each node has 2 NVIDIA L40 graphics cards attached for handling the pre/post processing workloads. Every such node has 2048 Gio of RAM configured in 16 memory channels with 2 x 64 Gio DDR5-4800 MHz memory arrays per channel and is also equipped with 2 x 480 Gio system SSDs (RAID1), and 3 x 3.2 Tio SSDs (non-RAID) for data. For inter-node communications, each node has 2 Slingshot 200 Gb/s network interfaces and a dual-port 10 Gb/s Ethernet card. Theses nodes will be connected to the rest of the machine via the Slingshot fabric and to the CINES facilities via Arista edge routers (see the diagram above).

System interconnect

The Adastra accelerated nodes are connected with 4 HPE-Cray Slingshot 200 Gb/s (25 Gio/s) NICs providing a total node-injection bandwidth of 800 Gb/s (100 Gio/s). Thus, each MI250X accelerator is connected to a NIC to facilitate GPU-GPU Remote Direct Memory Access (RDMA). The GPUs are directly connected to HPE-Cray’s Slingshot fabric which allows MPI communications operations such as send or receive to be executed directly from the GPU memory and across the network directly without interaction with the host CPU thus, improving throughput and latency by reducing redundant copies for codes that communicates intensively with other nodes.

Operating system

Adastra runs on the Red Hat Enterprise Linux 8.8 (Ootpa) (cat /etc/os-release) operating system. We use the Linux 4.18.0 kernel (uname -a).

File systems

Adastra is connected to the site-wide scratch LUSTRE filesystem providing 1.9 Pio of storage capacity with a measured peak read speed of 1086 Gio/s. It also has access to a home LUSTRE filesystem which provides and also to a work LUSTRE filesystem to keep data between jobs and a store LUSTRE filesystem to keep data and programs for a longer time (between allocation). See Accessing the storage areas for more details.

How many hours to ask for

Depending on your computation, you may be bound by the half precision, Binary32 or binary64 floating point ALU throughput or by the HBM throughput. Below, we offer a comparison of the commonly found GPU resources as of 04/2024.

../_images/node_normalized_performance.png

For instance, we observe that a MI250X GCD’s Binary64 throughput is 23.95 and that it consumes 280 W. Also, for the node normalized values, we assume we can pack 8 of such MI250X GCD (which is what we have on Adastra).

As another example, for the node normalized values, we observe that an A100 equipped node is offering x1.24 the bandwidth of a node equipped with MI250Xs. Also, a H100 node’s Binary16 throughput is x3.16 that of a A100 node.

From this, assuming your code is memory bound, running on V100 and that you want to run MI250X, you can expect a x1.82 node-to-node memory throughput speedup. All things being equals, if you use to ask for 100000 V100 GPU hours, you would ask for 100000/1.82~=55000 MI250X GPU hours.

Note

This computation does not take the DARI’s normalization factors into account as from a technical standpoint, they are debatable.