複查並驗證組態
複查及組態、GPU 和網路效能。
請複查組態
登入堡壘主機並複查組態。
- 使用 IP 位址和私密金鑰,以
opc
使用者身分 (Oracle Linux 執行處理預設) 連線至堡壘主機。akua$ ssh -i ~/.ssh/cluster.key opc@139.87.214.247 [opc@epsilon-bastion ~]$
df
指令會顯示掛載的檔案系統與容量:[opc@epsilon-bastion ~]$ df -h | grep -v tmp Filesystem Size Used Avail Use% Mounted on /dev/sda3 92G 14G 79G 15% / <- boot (home) volume /dev/sda1 200M 7.4M 193M 4% /boot/efi /dev/sdb 20T 58M 20T 1% /export/cluster <- Additional volume 172.16.0.75:/export/cluster 20T 57M 20T 1% /nfs/cluster 172.16.6.4:/mnt/localdisk/nfs 13T 39G 13T 1% /nfs/scratch <- worker node NVMe
- 編輯 Slurm 組態。
依照預設,Slurm 會在工作結束時自動移除容器。由於您可能會想要再次使用容器,因此使用
container_scope
引數讓容器在工作中持續存在更有效率。這將大幅加速使用相同容器的後續重新啟動。在檔案
/etc/slurm/plugstack.conf
中,附加container_scope=global
,使其看起來如下:[opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
- 在每個 GPU 節點和堡壘主機上停止並重新啟動 Slurm。GPU 節點主機名稱清單會顯示在
sinfo
的輸出中。請與pdsh
命令一起使用,在所有節點上執行systemctl
:export PS1="$ "
。[opc@epsilon-bastion ~]$ export PS1="\n$ " $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu* up infinite 2 idle gpu-permanent-node-[517,878] $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd $ sudo systemctl restart slurmctld slurmdbd $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd $ /etc/slurm/plugstack.conf required /usr/local/lib/slurm/spank_pyxis.so container_scope=global[opc@epsilon-bastion ~]$ export PS1="\n$ " $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu* up infinite 2 idle gpu-permanent-node-[517,878] $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd $ sudo systemctl restart slurmctld slurmdbd $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd $ /etc/slurm/plugstack.conf required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
- 套用作業系統更新。
請考慮將作業系統更新為最新的套裝程式。使用上一個步驟中的
pdsh
,有效地更新所有節點:# Oracle Linux 7: $ pdsh -w localhost, gpu-permanent-node-[517,878] sudo yum upgrade # Oracle Linux 8: $ pdsh -w localhost, gpu-permanent-node-[517,878] sudo dnf upgrade
編輯 Slurm 組態
依照預設,Slurm 會在工作結束時自動移除容器。由於您可能會想要再次使用容器,因此使用 container_scope
引數讓容器在工作中持續存在更有效率。這將大幅加速使用相同容器的後續重新啟動。
- 在檔案
/etc/slurm/plugstack.conf
中,附加container_scope=global
,使其看起來如下:[opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
- 在每個 GPU 節點和堡壘主機上停止並重新啟動 Slurm。 GPU 節點主機名稱清單會顯示在
sinfo
的輸出中。與pdsh
命令一起使用,在所有節點上執行systemctl
:export PS1="$ "
[opc@epsilon-bastion ~]$ export PS1="\n$ " $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu* up infinite 2 idle gpu-permanent-node-[517,878] $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd $ sudo systemctl restart slurmctld slurmdbd $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd $ /etc/slurm/plugstack.conf required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
提取或上傳容器
使用 Enroot 容器公用程式的 Slurm NVIDIA Pyxis Plugin 提供與 Slurm 工作負載管理程式整合的叢集容器執行環境。這些元件是在軟體配置期間勾選 Pyxis 與 Enroot 方塊時安裝的。
如需 Pyxis 所提供 srun --container
選項的詳細資訊,請參閱 https://github.com/NVIDIA/pyxis/ 或 srun --help
。
- 確認容器執行環境在叢集中如預期般運作。此範例會從 Nvidia 的
nvcr.io
儲存區域提取 TensorFlow 容器,然後執行簡單的命令。這會驗證容器執行環境是否如預期般運作。第一次執行時,它會從遠端位置下載大型容器,可能需要 25 分鐘或更長的時間來載入並開始執行。$ srun -N 2 --ntasks-per-node 1 \ --container-image=nvcr.io#nvidia/tensorflow:22.11-tf2-py3 \ --container-name=tensorflow bash -c "hostname; grep PRETTY /etc/os-release" pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3 pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3 gpu-permanent-node-517 PRETTY_NAME="Ubuntu 20.04.3 LTS" gpu-permanent-node-878
使用具名容器的後續工作不需要下載,將會立即開始執行。
$ time srun -N 2 --ntasks-per-node 1 --container-name=tensorflow bash -c "hostname" gpu-permanent-node-878 gpu-permanent-node-517 real 0m0.394s user 0m0.006s sys 0m0.009s
- 您可以選擇在將使用容器的工作之前先載入其他容器。 您可以在此處載入 NVIDIA NeMo Framework 容器,以準備 LLM 工作。您可能需要
~/.config/enroot/.credentials
中的 NVIDIA 認證資訊,才能存取 GA 或 EA 容器。$ cat .config/enroot/.credentials machine nvcr.io login $oauthtoken password vbmVtc2<snip>zU6YTFjNm $ time srun -N 2 --ntasks-per-node 1 \ --container-image="nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03" \ --container-name=nemo bash -c "hostname" pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03 pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03 gpu-permanent-node-878 gpu-permanent-node-517 real 46m27.272s
這個較大的容器需要將近 47 分鐘的時間才能匯入。
驗證 GPU 和網路效能
NVIDIA NCCL 是 GPU 標準通訊常式的獨立程式庫。Nccl-tests
會報告平均 NCCL 作業時間 (毫秒),以及演算法頻寬和匯流排頻寬 (GB/ 秒)。這些測試會測量 GPU 和網路的效能,並驗證作業的正確性。
- 從 GitHub 取得
NVIDIA nccl-tests
並執行下列命令,在堡壘主機上建立執行檔:$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \ bash -c "cd /home/opc; git clone https://github.com/NVIDIA/nccl-tests.git; cd nccl-tests; make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu"
- 執行
nccl-test
。下列是在使用八個 GPU 的一個叢集節點上執行 NCCL
AllReduce
作業:$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \ --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; \ ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8" # nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 226178 on gpu-permanent-node-517 device 0 [0x0f] NVIDIA A100-SXM4-40GB # Rank 1 Group 0 Pid 226178 on gpu-permanent-node-517 device 1 [0x15] NVIDIA A100-SXM4-40GB # Rank 2 Group 0 Pid 226178 on gpu-permanent-node-517 device 2 [0x50] NVIDIA A100-SXM4-40GB # Rank 3 Group 0 Pid 226178 on gpu-permanent-node-517 device 3 [0x53] NVIDIA A100-SXM4-40GB # Rank 4 Group 0 Pid 226178 on gpu-permanent-node-517 device 4 [0x8c] NVIDIA A100-SXM4-40GB # Rank 5 Group 0 Pid 226178 on gpu-permanent-node-517 device 5 [0x91] NVIDIA A100-SXM4-40GB # Rank 6 Group 0 Pid 226178 on gpu-permanent-node-517 device 6 [0xd6] NVIDIA A100-SXM4-40GB # Rank 7 Group 0 Pid 226178 on gpu-permanent-node-517 device 7 [0xda] NVIDIA A100-SXM4-40GB # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 10737418240 2684354560 float sum -1 80130 134.00 234.50 0 80171 133.93 234.38 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 234.439 #
- 在兩個具有 16 個 GPU 的叢集節點上執行 NCCL
AllReduce
。此測試會使用節點間叢集網路。
bastion$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8" srun -N 2 --ntasks-per-node 1 --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; export NCCL_IB_QPS_PER_CONNECTION=4; export NCCL_IB_GID_INDEX=3; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8" # nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 231185 on gpu-permanent-node-517 device 0 [0x0f] NVIDIA A100-SXM4-40GB # Rank 1 Group 0 Pid 231185 on gpu-permanent-node-517 device 1 [0x15] NVIDIA A100-SXM4-40GB # Rank 2 Group 0 Pid 231185 on gpu-permanent-node-517 device 2 [0x50] NVIDIA A100-SXM4-40GB # Rank 3 Group 0 Pid 231185 on gpu-permanent-node-517 device 3 [0x53] NVIDIA A100-SXM4-40GB # Rank 4 Group 0 Pid 231185 on gpu-permanent-node-517 device 4 [0x8c] NVIDIA A100-SXM4-40GB # Rank 5 Group 0 Pid 231185 on gpu-permanent-node-517 device 5 [0x91] NVIDIA A100-SXM4-40GB # Rank 6 Group 0 Pid 231185 on gpu-permanent-node-517 device 6 [0xd6] NVIDIA A100-SXM4-40GB # Rank 7 Group 0 Pid 231185 on gpu-permanent-node-517 device 7 [0xda] NVIDIA A100-SXM4-40GB # Rank 8 Group 0 Pid 221811 on gpu-permanent-node-878 device 0 [0x0f] NVIDIA A100-SXM4-40GB # Rank 9 Group 0 Pid 221811 on gpu-permanent-node-878 device 1 [0x15] NVIDIA A100-SXM4-40GB # Rank 10 Group 0 Pid 221811 on gpu-permanent-node-878 device 2 [0x50] NVIDIA A100-SXM4-40GB # Rank 11 Group 0 Pid 221811 on gpu-permanent-node-878 device 3 [0x53] NVIDIA A100-SXM4-40GB # Rank 12 Group 0 Pid 221811 on gpu-permanent-node-878 device 4 [0x8c] NVIDIA A100-SXM4-40GB # Rank 13 Group 0 Pid 221811 on gpu-permanent-node-878 device 5 [0x91] NVIDIA A100-SXM4-40GB # Rank 14 Group 0 Pid 221811 on gpu-permanent-node-878 device 6 [0xd6] NVIDIA A100-SXM4-40GB # Rank 15 Group 0 Pid 221811 on gpu-permanent-node-878 device 7 [0xda] NVIDIA A100-SXM4-40GB # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 10737418240 2684354560 float sum -1 90752 118.32 221.84 0 90977 118.02 221.29 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 221.568 #
成功的結果表示叢集已準備好執行您的生成式 AI 工作負載。