附註：

此教學課程需要存取 Oracle Cloud。若要註冊免費帳戶，請參閱開始使用 Oracle Cloud Infrastructure Free Tier 。
它使用 Oracle Cloud Infrastructure 憑證、租用戶及區間的範例值。完成實驗室時，請以雲端環境特有的值取代這些值。

NVIDIA GPUDirect RDMA 與 InfiniBand 寫入頻寬基準

簡介

效能基準是 HPC 的特色。最現代的超級電腦是具有異質架構的運算節點叢集。在此節點中，我們可以看到傳統 CPU 和特殊運算共處理器 (GPU)。本教學課程描述使用以 InfiniBand 寫入頻寬 (ib_write_bw) 為基礎的自訂命令檔來設定 NVIDIA GPUDirect Remote Direct Memory Access (GPUDirect RDMA) 基準的方法。

使用 ib_write_bw.sh 命令檔對 GPUDirect RDMA 進行基準測試可提供簡單有效的機制，在 HPC 叢集中執行 GPUDirect RDMA 基準測試，無須擔心軟體安裝、相依性或組態問題。此命令檔包含在 OCI HPC 堆疊 2.10.2 和更新版本中。OCI 堡壘主機主控台上會顯示包含所有介面詳細資訊的合併測試報表，並儲存在 /tmp 以供日後參考。

Remote Direct Memory Access (RDMA) 可讓週邊元件 Interconnect Express (PCIe) 裝置直接存取 GPU 記憶體。NVIDIA GPUDirect RDMA 專為 GPU 加速需求所設計，可在遠端系統中直接通訊 NVIDIA GPU。此技術可使用 PCI Express 的標準功能，直接在 GPU 和第三方對等裝置之間交換資料。它是在 Tesla 和 Quadro 類別的 GPU 上啟用，GPUDirect RDMA 仰賴 NVIDIA GPU 在 PCI Express Base Address Register 區域中顯示部分裝置記憶體的功能。

GPU 路馬

「gpurdma」

Perftest 套裝程式是透過 Uverbs 撰寫的測試集合，用來作為效能微指標。它包含一組頻寬與延遲基準，像是：

傳送：ib_send_bw 和 ib_send_lat
RDMA 讀取：ib_read_bw 和 ib_read_lat
RDMA 寫入：ib_write_bw 和 ib_write_lat
RDMA 原子：ib_atomic_bw 和 ib_atomic_lat
原生乙太網路 (使用 MOFED2 時) - raw_ethernet_bw 和 raw_ethernet_la

注意：perftest 套裝軟體需要使用 Compute Unified Device Architecture (CUDA) 編譯才能使用 GPUDirect 功能。

在本教學課程中，我們專注於使用 InfiniBand 寫入頻寬 (ib_write_bw) 傳送交易的 GPUDirect RDMA 頻寬測試，該測試可用於使用 RDMA 寫入交易來測試頻寬和延遲。我們將使用原生 ib_write_bw 的自訂包裝函式命令檔來自動執行安裝和測試程序。此命令檔可用來檢查叢集中兩個 GPU 節點之間的 ib_write_bw。如果節點上已安裝 CUDA，則執行將會使用 CUDA 重新編譯 perftest 套裝軟體。

目標

將 NVIDIA GPUDirect RDMA 與 ib_write_bw 上建立的自訂程序檔進行基準。

必要條件

CUDA Toolkit 11.7 或更新版本。
安裝 NVIDIA Open-Source GPU 核心模組版本 515 或更新版本。

使用 CUDA DMA-BUF 手動安裝 `perftest`

繼續手動安裝之前，請先確定符合所有先決條件，然後安裝在支援的 GPU 資源配置上。

組態 / 用途。

匯出下列環境變數 LD_LIBRARY_PATH 和 LIBRARY_PATH。

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH

複製 perftest 儲存區域並使用 CUDA 進行編譯。
```
git clone https://github.com/linux-rdma/perftest.git
```

複製之後使用下列命令。

cd perftest/
./autogen.sh && ./configure CUDA_H_PATH=<path to cuda.h> && make -j, e.g.:
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j

因此 -use_cuda= 旗標將可供新增至命令行：

./ib_write_bw -d ib_dev --use_cuda=<gpu index> -a

注意：使用 ib_write_bw 手動測試 GPUDirect RDMA 需要解除安裝現有的套裝軟體，然後使用 CUDA 重新編譯。我們必須先手動驗證節點上的節點資源配置、GPU 數目和作用中 RDMA 介面，才能繼續進行基準測試。

解決方案總覽

ib_write_bw.sh 是一種命令檔，可透過將所有與其相關的手動工作自動化，簡化 GPUDirect RDMA 基準處理作業。此命令檔可以直接從堡壘主機本身觸發，並包含所有引數。不需要在傳統的用戶端 - 伺服器模型中執行。程序檔會在執行期間執行下列檢查。如果這些檢查中有任何一項失敗，就會因發生錯誤而結束。

節點資源配置。
CUDA 安裝。
安裝的 GPU 和輸入的 GPU ID 總數。
伺服器和用戶端上的作用中 RDMA 介面。
支援型態
- BM.GPU.B4.8
- BM.GPU.A100-v2.8 系列
- BM.GPU4.8
必要條件
- 支援的 GPU 資源配置。
- 已安裝 CUDA 驅動程式與工具程式。

如果所有檢查都通過，ib_write_bw.sh 將會產生並執行可能的手冊，以執行安裝和組態設定。

Script

名稱：ib_write_bw.sh
地點：/opt/oci-hpc/scripts/
堆疊：HPC
堆疊版本：2.10.3 and above

使用狀況

sh ib_write_bw.sh -h

Usage:
./ib_write_bw.sh -s <server> -n <node> -c <y> -g <gpu id>
Options:
s     Server hostname
n     Client hostname.
c     Enable cuda (Default: Disabled)
g     GPU id (Default: 0)
h     Print this help.
Logs are stored at /tmp/logs
e.g.,  sh ./ib_write_bw.sh -s compute-permanent-node-1 -n compute-permanent-node-2 -c y -g 2
Supported shapes: BM.GPU.B4.8,BM.GPU.A100-v2.8,BM.GPU4.8

範例輸出

sh ib_write_bw.sh -s compute-permanent-node-14 -n compute-permanent-node-965 -c y -g 1
Shape: "BM.GPU4.8"
Server: compute-permanent-node-14
Client: compute-permanent-node-965
Cuda: y
GPU id: 1
Checking interfaces...
PLAY [all] *******************************************************************************************************************************
TASK [Gathering Facts] *******************************************************************************************************************
ok: [compute-permanent-node-14]
ok: [compute-permanent-node-965]
TASK [check cuda] ************************************************************************************************************************
ok: [compute-permanent-node-965]
ok: [compute-permanent-node-14]
.
.
Testing active interfaces...
mlx5_0
mlx5_1
mlx5_2
mlx5_3
mlx5_6
mlx5_7
mlx5_8
mlx5_9
mlx5_10
mlx5_11
mlx5_12
mlx5_13
mlx5_14
mlx5_15
mlx5_16
mlx5_17
ib_server.sh                                                                                            100%  630     2.8MB/s   00:00
ib_client.sh                                                                                            100%  697     2.9MB/s   00:00
Server Interface: mlx5_0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 0F:00
CUDA device 1: PCIe address is 15:00
CUDA device 2: PCIe address is 51:00
CUDA device 3: PCIe address is 54:00
CUDA device 4: PCIe address is 8D:00
CUDA device 5: PCIe address is 92:00
CUDA device 6: PCIe address is D6:00
CUDA device 7: PCIe address is DA:00
Picking device No. 1
[pid = 129753, dev = 1] device name = [NVIDIA A100-SXM4-40GB]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007f29df200000 pointer=0x7f29df200000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
Dual-port       : OFF                    Device         : mlx5_0
Number of qps   : 1                   Transport type : IB
Connection type : RC                 Using SRQ      : OFF
PCIe relax order: ON
ibv_wr* API     : ON
TX depth        : 128
CQ Moderation   : 1
Mtu             : 4096[B]
Link type       : Ethernet
GID index       : 3
Max inline data : 0[B]
rdma_cm QPs  : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x008b PSN 0xe4ad79 RKey 0x181de0 VAddr 0x007f29df210000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:19:196
remote address: LID 0000 QPN 0x008b PSN 0x96f625 RKey 0x181de0 VAddr 0x007f9c4b210000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:16:13
---------------------------------------------------------------------------------------
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407514           0.00                                35.61                   0.067920           0.78

測試摘要

GPUDirect 運算節點之每個介面的 RDMA 測試摘要都會顯示在堡壘主機上，而且同樣會儲存在堡壘主機的資料夾 /tmp/ib_bw 中。

它所尋找的重要參數是 BW 平均值 [Gb/sec]。

************** Test Summary **************
Server interface: mlx5_0
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407514           0.00                                35.61                   0.067920           0.78
---------------------------------------------------------------------------------------
Server interface: mlx5_1
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407569           0.00                                35.61                   0.067929           0.78
---------------------------------------------------------------------------------------
Server interface: mlx5_2
#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
65536      407401           0.00                                35.60                   0.067901           0.78
---------------------------------------------------------------------------------------

認可

作者 - Anoop Nair

其他學習資源

瀏覽 docs.oracle.com/learn 的其他實驗室，或前往 Oracle Learning YouTube 頻道存取更多免費學習內容。此外，請造訪 education.oracle.com/learning-explorer 以成為 Oracle Learning Explorer。

如需產品文件，請造訪 Oracle Help Center 。

標題與著作權資訊

Benchmark NVIDIA GPUDirect RDMA with InfiniBand Write Bandwidth

F90875-01

December 2023