Note:
- This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
- It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.
Benchmark NVIDIA GPUDirect RDMA with InfiniBand Write Bandwidth
Introduction
Performance benchmarking is the hallmark of HPC. The most modern supercomputers are clusters of computing nodes with a heterogeneous architecture. In such a node we can see both classical CPUs and specialized computing co-processors (GPUs). This tutorial describe an approach to benchmark NVIDIA GPUDirect Remote Direct Memory Access (GPUDirect RDMA) with a customized script built on InfiniBand write bandwidth (ib_write_bw
).
Benchmarking GPUDirect RDMA with ib_write_bw.sh
script provides an easy and effective mechanism to perform benchmarking GPUDirect RDMA in an HPC cluster without worrying about software installation, dependencies, or configuration. This script is included with OCI HPC stack 2.10.2 and above. A consolidated test report with details of all interfaces will be displayed on the OCI Bastion console and stored at /tmp
for future reference.
Remote Direct Memory Access (RDMA) enables Peripheral Component Interconnect Express (PCIe) devices direct access to GPU memory. Designed specifically for the needs of GPU acceleration, NVIDIA GPUDirect RDMA provides direct communication between NVIDIA GPUs in remote systems. It is a technology that enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express. It is enabled on Tesla and Quadro-class GPUs, GPUDirect RDMA relies on the ability of NVIDIA GPUs to expose portions of device memory on a PCI Express Base Address Register region.
GPU RDMA
Perftest Package is a collection of tests written over uverbs intended for use as a performance micro-benchmark. It contains a set of bandwidth and latency benchmark such as:
-
Send :
ib_send_bw
andib_send_lat
-
RDMA Read :
ib_read_bw
andib_read_lat
-
RDMA Write :
ib_write_bw
andib_write_lat
-
RDMA Atomic :
ib_atomic_bw
andib_atomic_lat
-
Native Ethernet (when working with MOFED2) -
raw_ethernet_bw
andraw_ethernet_la
Note:
perftest
package needs to be compiled with Compute Unified Device Architecture (CUDA) to utilize GPUDirect feature.
In this tutorial, we are focusing on a GPUDirect RDMA bandwidth test with send transactions using InfiniBand write bandwidth (ib_write_bw
), which can be used to test bandwidth and latency using RDMA write transactions. We will automate the installation and testing process with a custom wrapper script for native ib_write_bw
. This script can be used to check ib_write_bw
between two GPU nodes in the cluster. If CUDA is installed on the node, execution will re-compile a perftest
package with CUDA.
Objectives
- Benchmark NVIDIA GPUDirect RDMA with a customized script built on
ib_write_bw
.
Prerequisites
-
CUDA Toolkit 11.7 or later.
-
Install NVIDIA Open-Source GPU Kernel Modules version 515 or later.
Manual Installation of perftest
with CUDA DMA-BUF
Before proceeding with manual installation, ensure all prerequisites are met and we are installing on a supported GPU shape.
-
Configuration/Usage.
-
Export the following environment variables
LD_LIBRARY_PATH
andLIBRARY_PATH
.export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
-
-
Clone
perftest
repository and compile it with CUDA.git clone https://github.com/linux-rdma/perftest.git
-
After clone use the following command.
cd perftest/ ./autogen.sh && ./configure CUDA_H_PATH=<path to cuda.h> && make -j, e.g.: ./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j
Thus –use_cuda=
flag will be available to add to a command line: ./ib_write_bw -d ib_dev --use_cuda=<gpu index> -a
Note: Manual testing of GPUDirect RDMA with
ib_write_bw
requires uninstallation of existing package and recompiling it with CUDA. We need to verify node shape, GPU count and active RDMA interface on nodes manually before proceeding with benchmarking.
Solution Overview
ib_write_bw.sh is a script which simplifies GPUDirect RDMA benchmark process by automating all manual tasks related to it. This script can be triggered directly from the bastion itself with all arguments. No need to run in a traditional client-server model. Scripts perform the following checks during execution. If any of these checks fail, it will exit with an error.
- Node shape.
- CUDA installation.
- Total number of GPUs installed and gpu id entered.
- Active RDMA interfaces on server and client.
- Supported Shapes
- BM.GPU.B4.8
- BM.GPU.A100-v2.8
- BM.GPU4.8
- Prerequisites
- Supported GPU shape.
- Installed CUDA Drivers and Toolkits.
If all checks are passed, ib_write_bw.sh
will generate and execute an ansible playbook to perform the installation and configuration.
Script
-
Name:
ib_write_bw.sh
-
Location:
/opt/oci-hpc/scripts/
-
Stack:
HPC
-
Stack version:
2.10.3 and above
Usage
sh ib_write_bw.sh -h
Usage:
./ib_write_bw.sh -s <server> -n <node> -c <y> -g <gpu id>
Options:
s Server hostname
n Client hostname.
c Enable cuda (Default: Disabled)
g GPU id (Default: 0)
h Print this help.
Logs are stored at /tmp/logs
e.g., sh ./ib_write_bw.sh -s compute-permanent-node-1 -n compute-permanent-node-2 -c y -g 2
Supported shapes: BM.GPU.B4.8,BM.GPU.A100-v2.8,BM.GPU4.8
Sample Output
sh ib_write_bw.sh -s compute-permanent-node-14 -n compute-permanent-node-965 -c y -g 1
Shape: "BM.GPU4.8"
Server: compute-permanent-node-14
Client: compute-permanent-node-965
Cuda: y
GPU id: 1
Checking interfaces...
PLAY [all] *******************************************************************************************************************************
TASK [Gathering Facts] *******************************************************************************************************************
ok: [compute-permanent-node-14]
ok: [compute-permanent-node-965]
TASK [check cuda] ************************************************************************************************************************
ok: [compute-permanent-node-965]
ok: [compute-permanent-node-14]
.
.
Testing active interfaces...
mlx5_0
mlx5_1
mlx5_2
mlx5_3
mlx5_6
mlx5_7
mlx5_8
mlx5_9
mlx5_10
mlx5_11
mlx5_12
mlx5_13
mlx5_14
mlx5_15
mlx5_16
mlx5_17
ib_server.sh 100% 630 2.8MB/s 00:00
ib_client.sh 100% 697 2.9MB/s 00:00
Server Interface: mlx5_0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 0F:00
CUDA device 1: PCIe address is 15:00
CUDA device 2: PCIe address is 51:00
CUDA device 3: PCIe address is 54:00
CUDA device 4: PCIe address is 8D:00
CUDA device 5: PCIe address is 92:00
CUDA device 6: PCIe address is D6:00
CUDA device 7: PCIe address is DA:00
Picking device No. 1
[pid = 129753, dev = 1] device name = [NVIDIA A100-SXM4-40GB]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007f29df200000 pointer=0x7f29df200000
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x008b PSN 0xe4ad79 RKey 0x181de0 VAddr 0x007f29df210000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:19:196
remote address: LID 0000 QPN 0x008b PSN 0x96f625 RKey 0x181de0 VAddr 0x007f9c4b210000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:16:13
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] CPU_Util[%]
65536 407514 0.00 35.61 0.067920 0.78
Test Summary
GPUDirect RDMA test summary for each interface of the compute nodes will be displayed on the bastion and same will be stored in folder /tmp/ib_bw
on bastion.
Important parameter it look for is BW average[Gb/sec].
************** Test Summary **************
Server interface: mlx5_0
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] CPU_Util[%]
65536 407514 0.00 35.61 0.067920 0.78
---------------------------------------------------------------------------------------
Server interface: mlx5_1
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] CPU_Util[%]
65536 407569 0.00 35.61 0.067929 0.78
---------------------------------------------------------------------------------------
Server interface: mlx5_2
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] CPU_Util[%]
65536 407401 0.00 35.60 0.067901 0.78
---------------------------------------------------------------------------------------
Acknowledgment
- Author - Anoop Nair
More Learning Resources
Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.
For product documentation, visit Oracle Help Center.
Benchmark NVIDIA GPUDirect RDMA with InfiniBand Write Bandwidth
F90875-01
December 2023
Copyright © 2023, Oracle and/or its affiliates.