ノート:

このチュートリアルでは、Oracle Cloudへのアクセスが必要です。無料アカウントにサインアップするには、Oracle Cloud Infrastructure Free Tierの開始を参照してください。
Oracle Cloud Infrastructureの資格証明、テナンシおよびコンパートメントに例の値を使用します。演習を完了するときは、これらの値をクラウド環境に固有の値に置き換えます。

Benchmark NVIDIA GPUDirect RDMA with InfiniBand Write Bandwidth

イントロダクション

Performance benchmarking is the hallmark of HPC.最新のスーパーコンピュータは、異機種間アーキテクチャを持つコンピューティング・ノードのクラスタです。In such a node we can see both classical CPUs and specialized computing co-processors (GPUs).このチュートリアルでは、InfiniBand書込み帯域幅(ib_write_bw)上に構築されたカスタマイズされたスクリプトを使用して、NVIDIA GPUDirect Remote Direct Memory Access (GPUDirect RDMA)をベンチマークする方法を説明します。

Benchmarking GPUDirect RDMA with ib_write_bw.sh script provides an easy and effective mechanism to perform benchmarking GPUDirect RDMA in an HPC cluster without worrying about software installation, dependencies, or configuration.This script is included with OCI HPC stack 2.10.2 and above.すべてのインタフェースの詳細を含む統合テスト・レポートは、OCI Bastionコンソールに表示され、今後の参照のために/tmpに格納されます。

Remote Direct Memory Access (RDMA) enables Peripheral Component Interconnect Express (PCIe) devices direct access to GPU memory. Designed specifically for the needs of GPU acceleration, NVIDIA GPUDirect RDMA provides direct communication between NVIDIA GPUs in remote systems. It is a technology that enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express. It is enabled on Tesla and Quadro-class GPUs, GPUDirect RDMA relies on the ability of NVIDIA GPUs to expose portions of device memory on a PCI Express Base Address Register region.

GPU RDMA

"グプルドマ"

「Perftest Package」は、パフォーマンスマイクロベンチマークとして使用することを目的としたUverbs上で記述されたテストの集まりです。これには、次のような帯域幅と待機時間のベンチマークのセットが含まれます。

送信: ib_send_bwおよびib_send_lat
RDMA Read : ib_read_bw and ib_read_lat
RDMA Write : ib_write_bw and ib_write_lat
RDMA Atomic : ib_atomic_bw and ib_atomic_lat
Native Ethernet (when working with MOFED2) - raw_ethernet_bw and raw_ethernet_la

ノート: GPUDirect機能を使用するには、Compute Unified Device Architecture (CUDA)を使用してperftestパッケージをコンパイルする必要があります。

In this tutorial, we are focusing on a GPUDirect RDMA bandwidth test with send transactions using InfiniBand write bandwidth (ib_write_bw), which can be used to test bandwidth and latency using RDMA write transactions.ネイティブib_write_bw用のカスタム・ラッパー・スクリプトを使用して、インストールおよびテスト・プロセスを自動化します。This script can be used to check ib_write_bw between two GPU nodes in the cluster.If CUDA is installed on the node, execution will re-compile a perftest package with CUDA.

目的

Benchmark NVIDIA GPUDirect RDMA with a customized script built on ib_write_bw.

前提条件

CUDA Toolkit 11.7以降。
NVIDIA Open-Source GPU Kernel Modulesバージョン515以降をインストールします。

Manual Installation of `perftest` with CUDA DMA-BUF

Before proceeding with manual installation, ensure all prerequisites are met and we are installing on a supported GPU shape.

構成/使用方法

次の環境変数LD_LIBRARY_PATHおよびLIBRARY_PATHをエクスポートします。

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH

Clone perftest repository and compile it with CUDA.

git clone https://github.com/linux-rdma/perftest.git

クローンの後、次のコマンドを使用します。

cd perftest/
./autogen.sh && ./configure CUDA_H_PATH=<path to cuda.h> && make -j, e.g.:
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j

したがって、-use_cuda=フラグはコマンド行に追加できます。

./ib_write_bw -d ib_dev --use_cuda=<gpu index> -a

Note: Manual testing of GPUDirect RDMA with ib_write_bw requires uninstallation of existing package and recompiling it with CUDA. We need to verify node shape, GPU count and active RDMA interface on nodes manually before proceeding with benchmarking.

ソリューションの概要

ib_write_bw.sh is a script which simplifies GPUDirect RDMA benchmark process by automating all manual tasks related to it.このスクリプトは、すべての引数を使用して要塞自体から直接トリガーできます。従来のクライアント/サーバー・モデルで実行する必要はありません。スクリプトは、実行中に次のチェックを実行します。これらのチェックのいずれかが失敗すると、エラーで終了します。

ノード・シェイプ。
CUDA installation.
Total number of GPUs installed and gpu id entered.
Active RDMA interfaces on server and client.
サポートされるシェイプ
- BM.GPU.B4.8
- BM.GPU.A100-v2.8
- BM.GPU4.8
前提条件
- Supported GPU shape.
- Installed CUDA Drivers and Toolkits.

すべてのチェックが渡されると、ib_write_bw.shは可逆プレイブックを生成して実行し、インストールおよび構成を実行します。

スクリプト

名前: ib_write_bw.sh
場所: /opt/oci-hpc/scripts/
スタック: HPC
スタック・バージョン: 2.10.3 and above

使用方法

sh ib_write_bw.sh -h

Usage:
./ib_write_bw.sh -s <server> -n <node> -c <y> -g <gpu id>
Options:
s     Server hostname
n     Client hostname.
c     Enable cuda (Default: Disabled)
g     GPU id (Default: 0)
h     Print this help.
Logs are stored at /tmp/logs
e.g.,  sh ./ib_write_bw.sh -s compute-permanent-node-1 -n compute-permanent-node-2 -c y -g 2
Supported shapes: BM.GPU.B4.8,BM.GPU.A100-v2.8,BM.GPU4.8

サンプル出力

sh ib_write_bw.sh -s compute-permanent-node-14 -n compute-permanent-node-965 -c y -g 1
Shape: "BM.GPU4.8"
Server: compute-permanent-node-14
Client: compute-permanent-node-965
Cuda: y
GPU id: 1
Checking interfaces...
PLAY [all] *******************************************************************************************************************************
TASK [Gathering Facts] *******************************************************************************************************************
ok: [compute-permanent-node-14]
ok: [compute-permanent-node-965]
TASK [check cuda] ************************************************************************************************************************
ok: [compute-permanent-node-965]
ok: [compute-permanent-node-14]
.
.
Testing active interfaces...
mlx5_0
mlx5_1
mlx5_2
mlx5_3
mlx5_6
mlx5_7
mlx5_8
mlx5_9
mlx5_10
mlx5_11
mlx5_12
mlx5_13
mlx5_14
mlx5_15
mlx5_16
mlx5_17
ib_server.sh                                                                                            100%  630     2.8MB/s   00:00
ib_client.sh                                                                                            100%  697     2.9MB/s   00:00
Server Interface: mlx5_0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 0F:00
CUDA device 1: PCIe address is 15:00
CUDA device 2: PCIe address is 51:00
CUDA device 3: PCIe address is 54:00
CUDA device 4: PCIe address is 8D:00
CUDA device 5: PCIe address is 92:00
CUDA device 6: PCIe address is D6:00
CUDA device 7: PCIe address is DA:00
Picking device No. 1
[pid = 129753, dev = 1] device name = [NVIDIA A100-SXM4-40GB]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007f29df200000 pointer=0x7f29df200000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
Dual-port       : OFF                    Device         : mlx5_0
Number of qps   : 1                   Transport type : IB
Connection type : RC                 Using SRQ      : OFF
PCIe relax order: ON
ibv_wr* API     : ON
TX depth        : 128
CQ Moderation   : 1
Mtu             : 4096[B]
Link type       : Ethernet
GID index       : 3
Max inline data : 0[B]
rdma_cm QPs  : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x008b PSN 0xe4ad79 RKey 0x181de0 VAddr 0x007f29df210000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:19:196
remote address: LID 0000 QPN 0x008b PSN 0x96f625 RKey 0x181de0 VAddr 0x007f9c4b210000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:16:13
---------------------------------------------------------------------------------------
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407514           0.00                                35.61                   0.067920           0.78

テストの要約

GPUDirect RDMA test summary for each interface of the compute nodes will be displayed on the bastion and same will be stored in folder /tmp/ib_bw on bastion.

************** Test Summary **************
Server interface: mlx5_0
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407514           0.00                                35.61                   0.067920           0.78
---------------------------------------------------------------------------------------
Server interface: mlx5_1
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407569           0.00                                35.61                   0.067929           0.78
---------------------------------------------------------------------------------------
Server interface: mlx5_2
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407401           0.00                                35.60                   0.067901           0.78
---------------------------------------------------------------------------------------

確認

作成者 - Anoop Nair

その他の学習リソース

docs.oracle.com/learnの他のラボをご覧いただくか、Oracle Learning YouTubeチャネルで無料のラーニング・コンテンツにアクセスしてください。また、education.oracle.com/learning-explorerにアクセスしてOracle Learning Explorerになります。

製品ドキュメントは、Oracle Help Centerを参照してください。

タイトルおよび著作権情報

Benchmark NVIDIA GPUDirect RDMA with InfiniBand Write Bandwidth

F90875-01

December 2023

Benchmark NVIDIA GPUDirect RDMA with InfiniBand Write Bandwidth

イントロダクション

目的

前提条件

Manual Installation of perftest with CUDA DMA-BUF

ソリューションの概要

テストの要約

確認

その他の学習リソース

Manual Installation of `perftest` with CUDA DMA-BUF