复查并验证配置

查看和配置、GPU 以及网络性能。

复查配置

登录堡垒并查看配置。

使用 IP 地址和私钥以 opc 用户身份（Oracle Linux 实例为默认值）连接到堡垒。
```
akua$ ssh -i ~/.ssh/cluster.key opc@139.87.214.247
[opc@epsilon-bastion ~]$
```

df 命令显示已挂载的文件系统和容量：

[opc@epsilon-bastion ~]$ df -h | grep -v tmp
Filesystem                     Size  Used Avail Use% Mounted on
/dev/sda3                       92G   14G   79G  15% /                <- boot (home) volume
/dev/sda1                      200M  7.4M  193M   4% /boot/efi
/dev/sdb                        20T   58M   20T   1% /export/cluster  <- Additional volume
172.16.0.75:/export/cluster     20T   57M   20T   1% /nfs/cluster
172.16.6.4:/mnt/localdisk/nfs   13T   39G   13T   1% /nfs/scratch     <- worker node NVMe

编辑 Slurm 配置。

默认情况下，Slurm 在作业结束时自动删除容器。由于您可能希望再次使用该容器，因此使用 container_scope 参数使容器在作业之间保持有效性要高得多。这将大大加快使用相同容器的后续重新启动。

在文件 /etc/slurm/plugstack.conf 中，附加 container_scope=global，使其如下所示：
```
[opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
```

在每个 GPU 节点和堡垒上停止并重新启动 Slurm。

sinfo 的输出中显示了 GPU 节点主机名列表。将此项与 pdsh 命令一起使用，在所有节点上运行 systemctl：export PS1="$ "。

[opc@epsilon-bastion ~]$ export PS1="\n$ "

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]

$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
$ sudo systemctl restart slurmctld slurmdbd
$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd

$ /etc/slurm/plugstack.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global[opc@epsilon-bastion ~]$ export PS1="\n$ "

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]

$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
$ sudo systemctl restart slurmctld slurmdbd
$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd

$ /etc/slurm/plugstack.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global

应用 OS 更新。

考虑将 OS 更新到最新的软件包。与上一步一样使用 pdsh 可高效更新所有节点：

# Oracle Linux 7:
$ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo yum upgrade

# Oracle Linux 8:
$ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo dnf upgrade

编辑 Slurm 配置

默认情况下，Slurm 在作业结束时自动删除容器。由于您可能希望再次使用该容器，因此使用 container_scope 参数使容器在作业之间保持有效性要高得多。这将大大加快使用相同容器的后续重新启动。

在文件 /etc/slurm/plugstack.conf 中，附加 container_scope=global，使其如下所示：

[opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global

在每个 GPU 节点和堡垒上停止并重新启动 Slurm。

sinfo 的输出中显示了 GPU 节点主机名列表。将此项与 pdsh 命令一起用于在所有节点上运行 systemctl：export PS1="$ "

[opc@epsilon-bastion ~]$ export PS1="\n$ "

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]

$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
$ sudo systemctl restart slurmctld slurmdbd
$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd

$ /etc/slurm/plugstack.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global

应用 OS 更新

考虑将 OS 更新到最新的软件包。

与上一步一样使用 pdsh 可高效更新所有节点：

# Oracle Linux 7:
$ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo yum upgrade

# Oracle Linux 8:
$ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo dnf upgrade

提取或上载容器

具有 Enroot 容器实用程序的适用于 Slurm 的 NVIDIA Pyxis 插件提供与 Slurm 工作负载管理器集成的集群容器执行环境。在软件配置期间选中了 Pyxis 和 Enroot 框时安装了这些组件。

有关 Pyxis 提供的 srun --container 选项的详细信息，请参见 https://github.com/NVIDIA/pyxis/ 或 srun --help。

验证容器执行环境是否按预期在集群中工作。

此示例从 Nvidia 的 nvcr.io 系统信息库中提取 TensorFlow 容器并运行一个简单的命令。这将验证容器执行环境是否按预期在集群中工作。第一次运行它时，它会从远程位置下载一个大型容器，可能需要 25 分钟或更长时间才能加载并开始执行。

$ srun -N 2 --ntasks-per-node 1 \
  --container-image=nvcr.io#nvidia/tensorflow:22.11-tf2-py3 \
  --container-name=tensorflow bash -c "hostname; grep PRETTY /etc/os-release"
pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3
pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3
gpu-permanent-node-517
PRETTY_NAME="Ubuntu 20.04.3 LTS"
gpu-permanent-node-878

使用指定容器的后续作业不需要下载，并将立即开始执行。

$ time srun -N 2 --ntasks-per-node 1 --container-name=tensorflow bash -c "hostname"
gpu-permanent-node-878
gpu-permanent-node-517

real	0m0.394s
user	0m0.006s
sys	0m0.009s

您可以选择在将要使用的作业之前加载其他容器。

您可以在此处加载 NVIDIA NeMo Framework 容器，为 LLM 作业做准备。可能需要 ~/.config/enroot/.credentials 中的 NVIDIA 验证信息才能访问 GA 或 EA 容器。

$ cat .config/enroot/.credentials
machine nvcr.io login $oauthtoken password vbmVtc2<snip>zU6YTFjNm
$ time srun -N 2 --ntasks-per-node 1 \
  --container-image="nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03" \
  --container-name=nemo bash -c "hostname"
pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03
pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03
gpu-permanent-node-878
gpu-permanent-node-517

real	46m27.272s

这个更大的集装箱需要近 47 分钟才能进口。

验证 GPU 和网络性能

NVIDIA NCCL 是 GPU 标准通信例程的独立库。Nccl-tests 报告平均 NCCL 操作时间（毫秒），算法带宽和总线带宽（GB/秒）。这些测试测量 GPU 和网络的性能，并验证操作的正确性。

有关详细信息，请参见 https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md 。

从 GitHub 获取 NVIDIA nccl-tests 并通过运行以下命令在堡垒上构建可执行文件：

$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \
  bash -c "cd /home/opc; git clone https://github.com/NVIDIA/nccl-tests.git; cd nccl-tests; make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu"

运行 nccl-test。

以下命令在一个群集节点上使用八个 GPU 运行 NCCL AllReduce 操作：

$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \
  --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; \
  ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
# nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 226178 on gpu-permanent-node-517 device  0 [0x0f] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 226178 on gpu-permanent-node-517 device  1 [0x15] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 226178 on gpu-permanent-node-517 device  2 [0x50] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 226178 on gpu-permanent-node-517 device  3 [0x53] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid 226178 on gpu-permanent-node-517 device  4 [0x8c] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid 226178 on gpu-permanent-node-517 device  5 [0x91] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid 226178 on gpu-permanent-node-517 device  6 [0xd6] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid 226178 on gpu-permanent-node-517 device  7 [0xda] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
 10737418240    2684354560     float     sum      -1    80130  134.00  234.50      0    80171  133.93  234.38      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 234.439
#

在具有 16 个 GPU 的两个群集节点上运行 NCCL AllReduce。

此测试使用节点间群集网络。

bastion$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
srun -N 2 --ntasks-per-node 1 --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; export NCCL_IB_QPS_PER_CONNECTION=4; export NCCL_IB_GID_INDEX=3; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
# nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 231185 on gpu-permanent-node-517 device  0 [0x0f] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 231185 on gpu-permanent-node-517 device  1 [0x15] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 231185 on gpu-permanent-node-517 device  2 [0x50] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 231185 on gpu-permanent-node-517 device  3 [0x53] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid 231185 on gpu-permanent-node-517 device  4 [0x8c] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid 231185 on gpu-permanent-node-517 device  5 [0x91] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid 231185 on gpu-permanent-node-517 device  6 [0xd6] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid 231185 on gpu-permanent-node-517 device  7 [0xda] NVIDIA A100-SXM4-40GB
#  Rank  8 Group  0 Pid 221811 on gpu-permanent-node-878 device  0 [0x0f] NVIDIA A100-SXM4-40GB
#  Rank  9 Group  0 Pid 221811 on gpu-permanent-node-878 device  1 [0x15] NVIDIA A100-SXM4-40GB
#  Rank 10 Group  0 Pid 221811 on gpu-permanent-node-878 device  2 [0x50] NVIDIA A100-SXM4-40GB
#  Rank 11 Group  0 Pid 221811 on gpu-permanent-node-878 device  3 [0x53] NVIDIA A100-SXM4-40GB
#  Rank 12 Group  0 Pid 221811 on gpu-permanent-node-878 device  4 [0x8c] NVIDIA A100-SXM4-40GB
#  Rank 13 Group  0 Pid 221811 on gpu-permanent-node-878 device  5 [0x91] NVIDIA A100-SXM4-40GB
#  Rank 14 Group  0 Pid 221811 on gpu-permanent-node-878 device  6 [0xd6] NVIDIA A100-SXM4-40GB
#  Rank 15 Group  0 Pid 221811 on gpu-permanent-node-878 device  7 [0xda] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
 10737418240    2684354560     float     sum      -1    90752  118.32  221.84      0    90977  118.02  221.29      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 221.568
#