
複查及組態、GPU 和網路效能。



  1. 使用 IP 位址和私密金鑰,以 opc 使用者身分 (Oracle Linux 執行處理預設) 連線至堡壘主機。
    akua$ ssh -i ~/.ssh/cluster.key opc@
    [opc@epsilon-bastion ~]$
  2. df 指令會顯示掛載的檔案系統與容量:
    [opc@epsilon-bastion ~]$ df -h | grep -v tmp
    Filesystem                     Size  Used Avail Use% Mounted on
    /dev/sda3                       92G   14G   79G  15% /                <- boot (home) volume
    /dev/sda1                      200M  7.4M  193M   4% /boot/efi
    /dev/sdb                        20T   58M   20T   1% /export/cluster  <- Additional volume     20T   57M   20T   1% /nfs/cluster   13T   39G   13T   1% /nfs/scratch     <- worker node NVMe
  3. 編輯 Slurm 組態。

    依照預設,Slurm 會在工作結束時自動移除容器。由於您可能會想要再次使用容器,因此使用 container_scope 引數讓容器在工作中持續存在更有效率。這將大幅加速使用相同容器的後續重新啟動。

    在檔案 /etc/slurm/plugstack.conf 中,附加 container_scope=global,使其看起來如下:

    [opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
  4. 在每個 GPU 節點和堡壘主機上停止並重新啟動 Slurm。
    GPU 節點主機名稱清單會顯示在 sinfo 的輸出中。請與 pdsh 命令一起使用,在所有節點上執行 systemctl
    [opc@epsilon-bastion ~]$ export PS1="\n$ "
    $ sinfo
    gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
    $ sudo systemctl restart slurmctld slurmdbd
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd
    $ /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global[opc@epsilon-bastion ~]$ export PS1="\n$ "
    $ sinfo
    gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
    $ sudo systemctl restart slurmctld slurmdbd
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd
    $ /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
  5. 套用作業系統更新。

    請考慮將作業系統更新為最新的套裝程式。使用上一個步驟中的 pdsh,有效地更新所有節點:

    # Oracle Linux 7:
    $ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo yum upgrade
    # Oracle Linux 8:
    $ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo dnf upgrade

    GPU 節點主機名稱清單會顯示在 sinfo 的輸出中。與 pdsh 命令一起使用,在所有節點上執行 systemctl
使用 Enroot 容器公用程式的 Slurm NVIDIA Pyxis Plugin 提供與 Slurm 工作負載管理程式整合的叢集容器執行環境。這些元件是在軟體配置期間勾選 Pyxis 與 Enroot 方塊時安裝的。

如需 Pyxis 所提供 srun --container 選項的詳細資訊,請參閱 https://github.com/NVIDIA/pyxis/srun --help

  1. 確認容器執行環境在叢集中如預期般運作。
    此範例會從 Nvidia 的 nvcr.io 儲存區域提取 TensorFlow 容器,然後執行簡單的命令。這會驗證容器執行環境是否如預期般運作。第一次執行時,它會從遠端位置下載大型容器,可能需要 25 分鐘或更長的時間來載入並開始執行。
    $ srun -N 2 --ntasks-per-node 1 \
      --container-image=nvcr.io#nvidia/tensorflow:22.11-tf2-py3 \
      --container-name=tensorflow bash -c "hostname; grep PRETTY /etc/os-release"
    pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3
    pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3
    PRETTY_NAME="Ubuntu 20.04.3 LTS"


    $ time srun -N 2 --ntasks-per-node 1 --container-name=tensorflow bash -c "hostname"
    real	0m0.394s
    user	0m0.006s
    sys	0m0.009s
  2. 您可以選擇在將使用容器的工作之前先載入其他容器。
    您可以在此處載入 NVIDIA NeMo Framework 容器,以準備 LLM 工作。您可能需要 ~/.config/enroot/.credentials 中的 NVIDIA 認證資訊,才能存取 GA 或 EA 容器。
    $ cat .config/enroot/.credentials
    machine nvcr.io login $oauthtoken password vbmVtc2<snip>zU6YTFjNm
    $ time srun -N 2 --ntasks-per-node 1 \
      --container-image="nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03" \
      --container-name=nemo bash -c "hostname"
    pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03
    pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03
    real	46m27.272s

    這個較大的容器需要將近 47 分鐘的時間才能匯入。

驗證 GPU 和網路效能

NVIDIA NCCL 是 GPU 標準通訊常式的獨立程式庫。Nccl-tests 會報告平均 NCCL 作業時間 (毫秒),以及演算法頻寬和匯流排頻寬 (GB/ 秒)。這些測試會測量 GPU 和網路的效能,並驗證作業的正確性。

  1. 從 GitHub 取得 NVIDIA nccl-tests 並執行下列命令,在堡壘主機上建立執行檔:
    $ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \
      bash -c "cd /home/opc; git clone https://github.com/NVIDIA/nccl-tests.git; cd nccl-tests; make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu"
  2. 執行 nccl-test

    下列是在使用八個 GPU 的一個叢集節點上執行 NCCL AllReduce 作業:

    $ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \
      --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; \
      ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
    # nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
    # Using devices
    #  Rank  0 Group  0 Pid 226178 on gpu-permanent-node-517 device  0 [0x0f] NVIDIA A100-SXM4-40GB
    #  Rank  1 Group  0 Pid 226178 on gpu-permanent-node-517 device  1 [0x15] NVIDIA A100-SXM4-40GB
    #  Rank  2 Group  0 Pid 226178 on gpu-permanent-node-517 device  2 [0x50] NVIDIA A100-SXM4-40GB
    #  Rank  3 Group  0 Pid 226178 on gpu-permanent-node-517 device  3 [0x53] NVIDIA A100-SXM4-40GB
    #  Rank  4 Group  0 Pid 226178 on gpu-permanent-node-517 device  4 [0x8c] NVIDIA A100-SXM4-40GB
    #  Rank  5 Group  0 Pid 226178 on gpu-permanent-node-517 device  5 [0x91] NVIDIA A100-SXM4-40GB
    #  Rank  6 Group  0 Pid 226178 on gpu-permanent-node-517 device  6 [0xd6] NVIDIA A100-SXM4-40GB
    #  Rank  7 Group  0 Pid 226178 on gpu-permanent-node-517 device  7 [0xda] NVIDIA A100-SXM4-40GB
    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     10737418240    2684354560     float     sum      -1    80130  134.00  234.50      0    80171  133.93  234.38      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 234.439
  3. 在兩個具有 16 個 GPU 的叢集節點上執行 NCCL AllReduce


    bastion$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
    srun -N 2 --ntasks-per-node 1 --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; export NCCL_IB_QPS_PER_CONNECTION=4; export NCCL_IB_GID_INDEX=3; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
    # nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
    # Using devices
    #  Rank  0 Group  0 Pid 231185 on gpu-permanent-node-517 device  0 [0x0f] NVIDIA A100-SXM4-40GB
    #  Rank  1 Group  0 Pid 231185 on gpu-permanent-node-517 device  1 [0x15] NVIDIA A100-SXM4-40GB
    #  Rank  2 Group  0 Pid 231185 on gpu-permanent-node-517 device  2 [0x50] NVIDIA A100-SXM4-40GB
    #  Rank  3 Group  0 Pid 231185 on gpu-permanent-node-517 device  3 [0x53] NVIDIA A100-SXM4-40GB
    #  Rank  4 Group  0 Pid 231185 on gpu-permanent-node-517 device  4 [0x8c] NVIDIA A100-SXM4-40GB
    #  Rank  5 Group  0 Pid 231185 on gpu-permanent-node-517 device  5 [0x91] NVIDIA A100-SXM4-40GB
    #  Rank  6 Group  0 Pid 231185 on gpu-permanent-node-517 device  6 [0xd6] NVIDIA A100-SXM4-40GB
    #  Rank  7 Group  0 Pid 231185 on gpu-permanent-node-517 device  7 [0xda] NVIDIA A100-SXM4-40GB
    #  Rank  8 Group  0 Pid 221811 on gpu-permanent-node-878 device  0 [0x0f] NVIDIA A100-SXM4-40GB
    #  Rank  9 Group  0 Pid 221811 on gpu-permanent-node-878 device  1 [0x15] NVIDIA A100-SXM4-40GB
    #  Rank 10 Group  0 Pid 221811 on gpu-permanent-node-878 device  2 [0x50] NVIDIA A100-SXM4-40GB
    #  Rank 11 Group  0 Pid 221811 on gpu-permanent-node-878 device  3 [0x53] NVIDIA A100-SXM4-40GB
    #  Rank 12 Group  0 Pid 221811 on gpu-permanent-node-878 device  4 [0x8c] NVIDIA A100-SXM4-40GB
    #  Rank 13 Group  0 Pid 221811 on gpu-permanent-node-878 device  5 [0x91] NVIDIA A100-SXM4-40GB
    #  Rank 14 Group  0 Pid 221811 on gpu-permanent-node-878 device  6 [0xd6] NVIDIA A100-SXM4-40GB
    #  Rank 15 Group  0 Pid 221811 on gpu-permanent-node-878 device  7 [0xda] NVIDIA A100-SXM4-40GB
    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     10737418240    2684354560     float     sum      -1    90752  118.32  221.84      0    90977  118.02  221.29      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 221.568

    成功的結果表示叢集已準備好執行您的生成式 AI 工作負載。