Note:

This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.

Benchmark NVIDIA GPUDirect RDMA with InfiniBand Write Bandwidth

Introduction

Performance benchmarking is the hallmark of HPC. The most modern supercomputers are clusters of computing nodes with a heterogeneous architecture. In such a node we can see both classical CPUs and specialized computing co-processors (GPUs). This tutorial describe an approach to benchmark NVIDIA GPUDirect Remote Direct Memory Access (GPUDirect RDMA) with a customized script built on InfiniBand write bandwidth (ib_write_bw).

Benchmarking GPUDirect RDMA with ib_write_bw.sh script provides an easy and effective mechanism to perform benchmarking GPUDirect RDMA in an HPC cluster without worrying about software installation, dependencies, or configuration. This script is included with OCI HPC stack 2.10.2 and above. A consolidated test report with details of all interfaces will be displayed on the OCI Bastion console and stored at /tmp for future reference.

Remote Direct Memory Access (RDMA) enables Peripheral Component Interconnect Express (PCIe) devices direct access to GPU memory. Designed specifically for the needs of GPU acceleration, NVIDIA GPUDirect RDMA provides direct communication between NVIDIA GPUs in remote systems. It is a technology that enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express. It is enabled on Tesla and Quadro-class GPUs, GPUDirect RDMA relies on the ability of NVIDIA GPUs to expose portions of device memory on a PCI Express Base Address Register region.

GPU RDMA

"gpurdma"

Perftest Package is a collection of tests written over uverbs intended for use as a performance micro-benchmark. It contains a set of bandwidth and latency benchmark such as:

Send : ib_send_bw and ib_send_lat
RDMA Read : ib_read_bw and ib_read_lat
RDMA Write : ib_write_bw and ib_write_lat
RDMA Atomic : ib_atomic_bw and ib_atomic_lat
Native Ethernet (when working with MOFED2) - raw_ethernet_bw and raw_ethernet_la

Note: perftest package needs to be compiled with Compute Unified Device Architecture (CUDA) to utilize GPUDirect feature.

In this tutorial, we are focusing on a GPUDirect RDMA bandwidth test with send transactions using InfiniBand write bandwidth (ib_write_bw), which can be used to test bandwidth and latency using RDMA write transactions. We will automate the installation and testing process with a custom wrapper script for native ib_write_bw. This script can be used to check ib_write_bw between two GPU nodes in the cluster. If CUDA is installed on the node, execution will re-compile a perftest package with CUDA.

Objectives

Benchmark NVIDIA GPUDirect RDMA with a customized script built on ib_write_bw.

Prerequisites

CUDA Toolkit 11.7 or later.
Install NVIDIA Open-Source GPU Kernel Modules version 515 or later.

Manual Installation of `perftest` with CUDA DMA-BUF

Before proceeding with manual installation, ensure all prerequisites are met and we are installing on a supported GPU shape.

Configuration/Usage.

Export the following environment variables LD_LIBRARY_PATH and LIBRARY_PATH.

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH

Clone perftest repository and compile it with CUDA.

git clone https://github.com/linux-rdma/perftest.git

After clone use the following command.

cd perftest/
./autogen.sh && ./configure CUDA_H_PATH=<path to cuda.h> && make -j, e.g.:
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j

Thus –use_cuda= flag will be available to add to a command line:

./ib_write_bw -d ib_dev --use_cuda=<gpu index> -a

Note: Manual testing of GPUDirect RDMA with ib_write_bw requires uninstallation of existing package and recompiling it with CUDA. We need to verify node shape, GPU count and active RDMA interface on nodes manually before proceeding with benchmarking.

Solution Overview

ib_write_bw.sh is a script which simplifies GPUDirect RDMA benchmark process by automating all manual tasks related to it. This script can be triggered directly from the bastion itself with all arguments. No need to run in a traditional client-server model. Scripts perform the following checks during execution. If any of these checks fail, it will exit with an error.

Node shape.
CUDA installation.
Total number of GPUs installed and gpu id entered.
Active RDMA interfaces on server and client.
Supported Shapes
- BM.GPU.B4.8
- BM.GPU.A100-v2.8
- BM.GPU4.8
Prerequisites
- Supported GPU shape.
- Installed CUDA Drivers and Toolkits.

If all checks are passed, ib_write_bw.sh will generate and execute an ansible playbook to perform the installation and configuration.

Script

Name: ib_write_bw.sh
Location: /opt/oci-hpc/scripts/
Stack: HPC
Stack version: 2.10.3 and above

Usage

sh ib_write_bw.sh -h

Usage:
./ib_write_bw.sh -s <server> -n <node> -c <y> -g <gpu id>
Options:
s     Server hostname
n     Client hostname.
c     Enable cuda (Default: Disabled)
g     GPU id (Default: 0)
h     Print this help.
Logs are stored at /tmp/logs
e.g.,  sh ./ib_write_bw.sh -s compute-permanent-node-1 -n compute-permanent-node-2 -c y -g 2
Supported shapes: BM.GPU.B4.8,BM.GPU.A100-v2.8,BM.GPU4.8

Sample Output

sh ib_write_bw.sh -s compute-permanent-node-14 -n compute-permanent-node-965 -c y -g 1
Shape: "BM.GPU4.8"
Server: compute-permanent-node-14
Client: compute-permanent-node-965
Cuda: y
GPU id: 1
Checking interfaces...
PLAY [all] *******************************************************************************************************************************
TASK [Gathering Facts] *******************************************************************************************************************
ok: [compute-permanent-node-14]
ok: [compute-permanent-node-965]
TASK [check cuda] ************************************************************************************************************************
ok: [compute-permanent-node-965]
ok: [compute-permanent-node-14]
.
.
Testing active interfaces...
mlx5_0
mlx5_1
mlx5_2
mlx5_3
mlx5_6
mlx5_7
mlx5_8
mlx5_9
mlx5_10
mlx5_11
mlx5_12
mlx5_13
mlx5_14
mlx5_15
mlx5_16
mlx5_17
ib_server.sh                                                                                            100%  630     2.8MB/s   00:00
ib_client.sh                                                                                            100%  697     2.9MB/s   00:00
Server Interface: mlx5_0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 0F:00
CUDA device 1: PCIe address is 15:00
CUDA device 2: PCIe address is 51:00
CUDA device 3: PCIe address is 54:00
CUDA device 4: PCIe address is 8D:00
CUDA device 5: PCIe address is 92:00
CUDA device 6: PCIe address is D6:00
CUDA device 7: PCIe address is DA:00
Picking device No. 1
[pid = 129753, dev = 1] device name = [NVIDIA A100-SXM4-40GB]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007f29df200000 pointer=0x7f29df200000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
Dual-port       : OFF                    Device         : mlx5_0
Number of qps   : 1                   Transport type : IB
Connection type : RC                 Using SRQ      : OFF
PCIe relax order: ON
ibv_wr* API     : ON
TX depth        : 128
CQ Moderation   : 1
Mtu             : 4096[B]
Link type       : Ethernet
GID index       : 3
Max inline data : 0[B]
rdma_cm QPs  : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x008b PSN 0xe4ad79 RKey 0x181de0 VAddr 0x007f29df210000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:19:196
remote address: LID 0000 QPN 0x008b PSN 0x96f625 RKey 0x181de0 VAddr 0x007f9c4b210000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:16:13
---------------------------------------------------------------------------------------
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407514           0.00                                35.61                   0.067920           0.78

Test Summary

GPUDirect RDMA test summary for each interface of the compute nodes will be displayed on the bastion and same will be stored in folder /tmp/ib_bw on bastion.

Important parameter it look for is BW average[Gb/sec].

************** Test Summary **************
Server interface: mlx5_0
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407514           0.00                                35.61                   0.067920           0.78
---------------------------------------------------------------------------------------
Server interface: mlx5_1
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407569           0.00                                35.61                   0.067929           0.78
---------------------------------------------------------------------------------------
Server interface: mlx5_2
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407401           0.00                                35.60                   0.067901           0.78
---------------------------------------------------------------------------------------

Acknowledgment

Author - Anoop Nair

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.

Title and Copyright Information

Benchmark NVIDIA GPUDirect RDMA with InfiniBand Write Bandwidth

F90875-01

December 2023

Benchmark NVIDIA GPUDirect RDMA with InfiniBand Write Bandwidth

Introduction

Objectives

Prerequisites

Manual Installation of perftest with CUDA DMA-BUF

Solution Overview

Test Summary

Acknowledgment

More Learning Resources

Manual Installation of `perftest` with CUDA DMA-BUF