复查并验证配置

查看和配置、GPU 以及网络性能。

复查配置

登录堡垒并查看配置。

  1. 使用 IP 地址和私钥以 opc 用户身份(Oracle Linux 实例为默认值)连接到堡垒。
    akua$ ssh -i ~/.ssh/cluster.key opc@139.87.214.247
    [opc@epsilon-bastion ~]$
  2. df 命令显示已挂载的文件系统和容量:
    [opc@epsilon-bastion ~]$ df -h | grep -v tmp
    Filesystem                     Size  Used Avail Use% Mounted on
    /dev/sda3                       92G   14G   79G  15% /                <- boot (home) volume
    /dev/sda1                      200M  7.4M  193M   4% /boot/efi
    /dev/sdb                        20T   58M   20T   1% /export/cluster  <- Additional volume
    172.16.0.75:/export/cluster     20T   57M   20T   1% /nfs/cluster
    172.16.6.4:/mnt/localdisk/nfs   13T   39G   13T   1% /nfs/scratch     <- worker node NVMe
  3. 编辑 Slurm 配置。

    默认情况下,Slurm 在作业结束时自动删除容器。由于您可能希望再次使用该容器,因此使用 container_scope 参数使容器在作业之间保持有效性要高得多。这将大大加快使用相同容器的后续重新启动。

    在文件 /etc/slurm/plugstack.conf 中,附加 container_scope=global,使其如下所示:

    [opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
  4. 在每个 GPU 节点和堡垒上停止并重新启动 Slurm。
    sinfo 的输出中显示了 GPU 节点主机名列表。将此项与 pdsh 命令一起使用,在所有节点上运行 systemctlexport PS1="$ "
    [opc@epsilon-bastion ~]$ export PS1="\n$ "
    
    $ sinfo
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]
    
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
    $ sudo systemctl restart slurmctld slurmdbd
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd
    
    $ /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global[opc@epsilon-bastion ~]$ export PS1="\n$ "
    
    $ sinfo
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]
    
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
    $ sudo systemctl restart slurmctld slurmdbd
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd
    
    $ /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
  5. 应用 OS 更新。

    考虑将 OS 更新到最新的软件包。与上一步一样使用 pdsh 可高效更新所有节点:

    # Oracle Linux 7:
    $ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo yum upgrade
    
    # Oracle Linux 8:
    $ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo dnf upgrade

编辑 Slurm 配置

默认情况下,Slurm 在作业结束时自动删除容器。由于您可能希望再次使用该容器,因此使用 container_scope 参数使容器在作业之间保持有效性要高得多。这将大大加快使用相同容器的后续重新启动。

  1. 在文件 /etc/slurm/plugstack.conf 中,附加 container_scope=global,使其如下所示:
    [opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
  2. 在每个 GPU 节点和堡垒上停止并重新启动 Slurm。
    sinfo 的输出中显示了 GPU 节点主机名列表。将此项与 pdsh 命令一起用于在所有节点上运行 systemctlexport PS1="$ "
    [opc@epsilon-bastion ~]$ export PS1="\n$ "
    
    $ sinfo
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]
    
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
    $ sudo systemctl restart slurmctld slurmdbd
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd
    
    $ /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global

应用 OS 更新

考虑将 OS 更新到最新的软件包。

  • 与上一步一样使用 pdsh 可高效更新所有节点:
    # Oracle Linux 7:
    $ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo yum upgrade
    
    # Oracle Linux 8:
    $ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo dnf upgrade

提取或上载容器

具有 Enroot 容器实用程序的适用于 Slurm 的 NVIDIA Pyxis 插件提供与 Slurm 工作负载管理器集成的集群容器执行环境。在软件配置期间选中了 Pyxis 和 Enroot 框时安装了这些组件。

有关 Pyxis 提供的 srun --container 选项的详细信息,请参见 https://github.com/NVIDIA/pyxis/srun --help

  1. 验证容器执行环境是否按预期在集群中工作。
    此示例从 Nvidia 的 nvcr.io 系统信息库中提取 TensorFlow 容器并运行一个简单的命令。这将验证容器执行环境是否按预期在集群中工作。第一次运行它时,它会从远程位置下载一个大型容器,可能需要 25 分钟或更长时间才能加载并开始执行。
    $ srun -N 2 --ntasks-per-node 1 \
      --container-image=nvcr.io#nvidia/tensorflow:22.11-tf2-py3 \
      --container-name=tensorflow bash -c "hostname; grep PRETTY /etc/os-release"
    pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3
    pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3
    gpu-permanent-node-517
    PRETTY_NAME="Ubuntu 20.04.3 LTS"
    gpu-permanent-node-878

    使用指定容器的后续作业不需要下载,并将立即开始执行。

    $ time srun -N 2 --ntasks-per-node 1 --container-name=tensorflow bash -c "hostname"
    gpu-permanent-node-878
    gpu-permanent-node-517
    
    real	0m0.394s
    user	0m0.006s
    sys	0m0.009s
  2. 您可以选择在将要使用的作业之前加载其他容器。
    您可以在此处加载 NVIDIA NeMo Framework 容器,为 LLM 作业做准备。可能需要 ~/.config/enroot/.credentials 中的 NVIDIA 验证信息才能访问 GA 或 EA 容器。
    $ cat .config/enroot/.credentials
    machine nvcr.io login $oauthtoken password vbmVtc2<snip>zU6YTFjNm
    $ time srun -N 2 --ntasks-per-node 1 \
      --container-image="nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03" \
      --container-name=nemo bash -c "hostname"
    pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03
    pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03
    gpu-permanent-node-878
    gpu-permanent-node-517
    
    real	46m27.272s

    这个更大的集装箱需要近 47 分钟才能进口。

验证 GPU 和网络性能

NVIDIA NCCL 是 GPU 标准通信例程的独立库。Nccl-tests 报告平均 NCCL 操作时间(毫秒),算法带宽和总线带宽(GB/秒)。这些测试测量 GPU 和网络的性能,并验证操作的正确性。

  1. 从 GitHub 获取 NVIDIA nccl-tests 并通过运行以下命令在堡垒上构建可执行文件:
    $ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \
      bash -c "cd /home/opc; git clone https://github.com/NVIDIA/nccl-tests.git; cd nccl-tests; make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu"
  2. 运行 nccl-test

    以下命令在一个群集节点上使用八个 GPU 运行 NCCL AllReduce 操作:

    $ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \
      --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; \
      ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
    # nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
    #
    # Using devices
    #  Rank  0 Group  0 Pid 226178 on gpu-permanent-node-517 device  0 [0x0f] NVIDIA A100-SXM4-40GB
    #  Rank  1 Group  0 Pid 226178 on gpu-permanent-node-517 device  1 [0x15] NVIDIA A100-SXM4-40GB
    #  Rank  2 Group  0 Pid 226178 on gpu-permanent-node-517 device  2 [0x50] NVIDIA A100-SXM4-40GB
    #  Rank  3 Group  0 Pid 226178 on gpu-permanent-node-517 device  3 [0x53] NVIDIA A100-SXM4-40GB
    #  Rank  4 Group  0 Pid 226178 on gpu-permanent-node-517 device  4 [0x8c] NVIDIA A100-SXM4-40GB
    #  Rank  5 Group  0 Pid 226178 on gpu-permanent-node-517 device  5 [0x91] NVIDIA A100-SXM4-40GB
    #  Rank  6 Group  0 Pid 226178 on gpu-permanent-node-517 device  6 [0xd6] NVIDIA A100-SXM4-40GB
    #  Rank  7 Group  0 Pid 226178 on gpu-permanent-node-517 device  7 [0xda] NVIDIA A100-SXM4-40GB
    #
    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     10737418240    2684354560     float     sum      -1    80130  134.00  234.50      0    80171  133.93  234.38      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 234.439
    #
  3. 在具有 16 个 GPU 的两个群集节点上运行 NCCL AllReduce

    此测试使用节点间群集网络。

    bastion$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
    srun -N 2 --ntasks-per-node 1 --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; export NCCL_IB_QPS_PER_CONNECTION=4; export NCCL_IB_GID_INDEX=3; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
    # nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
    #
    # Using devices
    #  Rank  0 Group  0 Pid 231185 on gpu-permanent-node-517 device  0 [0x0f] NVIDIA A100-SXM4-40GB
    #  Rank  1 Group  0 Pid 231185 on gpu-permanent-node-517 device  1 [0x15] NVIDIA A100-SXM4-40GB
    #  Rank  2 Group  0 Pid 231185 on gpu-permanent-node-517 device  2 [0x50] NVIDIA A100-SXM4-40GB
    #  Rank  3 Group  0 Pid 231185 on gpu-permanent-node-517 device  3 [0x53] NVIDIA A100-SXM4-40GB
    #  Rank  4 Group  0 Pid 231185 on gpu-permanent-node-517 device  4 [0x8c] NVIDIA A100-SXM4-40GB
    #  Rank  5 Group  0 Pid 231185 on gpu-permanent-node-517 device  5 [0x91] NVIDIA A100-SXM4-40GB
    #  Rank  6 Group  0 Pid 231185 on gpu-permanent-node-517 device  6 [0xd6] NVIDIA A100-SXM4-40GB
    #  Rank  7 Group  0 Pid 231185 on gpu-permanent-node-517 device  7 [0xda] NVIDIA A100-SXM4-40GB
    #  Rank  8 Group  0 Pid 221811 on gpu-permanent-node-878 device  0 [0x0f] NVIDIA A100-SXM4-40GB
    #  Rank  9 Group  0 Pid 221811 on gpu-permanent-node-878 device  1 [0x15] NVIDIA A100-SXM4-40GB
    #  Rank 10 Group  0 Pid 221811 on gpu-permanent-node-878 device  2 [0x50] NVIDIA A100-SXM4-40GB
    #  Rank 11 Group  0 Pid 221811 on gpu-permanent-node-878 device  3 [0x53] NVIDIA A100-SXM4-40GB
    #  Rank 12 Group  0 Pid 221811 on gpu-permanent-node-878 device  4 [0x8c] NVIDIA A100-SXM4-40GB
    #  Rank 13 Group  0 Pid 221811 on gpu-permanent-node-878 device  5 [0x91] NVIDIA A100-SXM4-40GB
    #  Rank 14 Group  0 Pid 221811 on gpu-permanent-node-878 device  6 [0xd6] NVIDIA A100-SXM4-40GB
    #  Rank 15 Group  0 Pid 221811 on gpu-permanent-node-878 device  7 [0xda] NVIDIA A100-SXM4-40GB
    #
    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     10737418240    2684354560     float     sum      -1    90752  118.32  221.84      0    90977  118.02  221.29      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 221.568
    #

    成功的结果表明集群已准备好运行生成式 AI 工作负载。