复查并验证配置
查看和配置、GPU 以及网络性能。
复查配置
登录堡垒并查看配置。
- 使用 IP 地址和私钥以
opc
用户身份(Oracle Linux 实例为默认值)连接到堡垒。akua$ ssh -i ~/.ssh/cluster.key opc@139.87.214.247 [opc@epsilon-bastion ~]$
df
命令显示已挂载的文件系统和容量:[opc@epsilon-bastion ~]$ df -h | grep -v tmp Filesystem Size Used Avail Use% Mounted on /dev/sda3 92G 14G 79G 15% / <- boot (home) volume /dev/sda1 200M 7.4M 193M 4% /boot/efi /dev/sdb 20T 58M 20T 1% /export/cluster <- Additional volume 172.16.0.75:/export/cluster 20T 57M 20T 1% /nfs/cluster 172.16.6.4:/mnt/localdisk/nfs 13T 39G 13T 1% /nfs/scratch <- worker node NVMe
- 编辑 Slurm 配置。
默认情况下,Slurm 在作业结束时自动删除容器。由于您可能希望再次使用该容器,因此使用
container_scope
参数使容器在作业之间保持有效性要高得多。这将大大加快使用相同容器的后续重新启动。在文件
/etc/slurm/plugstack.conf
中,附加container_scope=global
,使其如下所示:[opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
- 在每个 GPU 节点和堡垒上停止并重新启动 Slurm。
sinfo
的输出中显示了 GPU 节点主机名列表。将此项与pdsh
命令一起使用,在所有节点上运行systemctl
:export PS1="$ "
。[opc@epsilon-bastion ~]$ export PS1="\n$ " $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu* up infinite 2 idle gpu-permanent-node-[517,878] $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd $ sudo systemctl restart slurmctld slurmdbd $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd $ /etc/slurm/plugstack.conf required /usr/local/lib/slurm/spank_pyxis.so container_scope=global[opc@epsilon-bastion ~]$ export PS1="\n$ " $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu* up infinite 2 idle gpu-permanent-node-[517,878] $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd $ sudo systemctl restart slurmctld slurmdbd $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd $ /etc/slurm/plugstack.conf required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
- 应用 OS 更新。
考虑将 OS 更新到最新的软件包。与上一步一样使用
pdsh
可高效更新所有节点:# Oracle Linux 7: $ pdsh -w localhost, gpu-permanent-node-[517,878] sudo yum upgrade # Oracle Linux 8: $ pdsh -w localhost, gpu-permanent-node-[517,878] sudo dnf upgrade
编辑 Slurm 配置
默认情况下,Slurm 在作业结束时自动删除容器。由于您可能希望再次使用该容器,因此使用 container_scope
参数使容器在作业之间保持有效性要高得多。这将大大加快使用相同容器的后续重新启动。
- 在文件
/etc/slurm/plugstack.conf
中,附加container_scope=global
,使其如下所示:[opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
- 在每个 GPU 节点和堡垒上停止并重新启动 Slurm。
sinfo
的输出中显示了 GPU 节点主机名列表。将此项与pdsh
命令一起用于在所有节点上运行systemctl
:export PS1="$ "
[opc@epsilon-bastion ~]$ export PS1="\n$ " $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu* up infinite 2 idle gpu-permanent-node-[517,878] $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd $ sudo systemctl restart slurmctld slurmdbd $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd $ /etc/slurm/plugstack.conf required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
提取或上载容器
具有 Enroot 容器实用程序的适用于 Slurm 的 NVIDIA Pyxis 插件提供与 Slurm 工作负载管理器集成的集群容器执行环境。在软件配置期间选中了 Pyxis 和 Enroot 框时安装了这些组件。
有关 Pyxis 提供的 srun --container
选项的详细信息,请参见 https://github.com/NVIDIA/pyxis/ 或 srun --help
。
- 验证容器执行环境是否按预期在集群中工作。此示例从 Nvidia 的
nvcr.io
系统信息库中提取 TensorFlow 容器并运行一个简单的命令。这将验证容器执行环境是否按预期在集群中工作。第一次运行它时,它会从远程位置下载一个大型容器,可能需要 25 分钟或更长时间才能加载并开始执行。$ srun -N 2 --ntasks-per-node 1 \ --container-image=nvcr.io#nvidia/tensorflow:22.11-tf2-py3 \ --container-name=tensorflow bash -c "hostname; grep PRETTY /etc/os-release" pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3 pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3 gpu-permanent-node-517 PRETTY_NAME="Ubuntu 20.04.3 LTS" gpu-permanent-node-878
使用指定容器的后续作业不需要下载,并将立即开始执行。
$ time srun -N 2 --ntasks-per-node 1 --container-name=tensorflow bash -c "hostname" gpu-permanent-node-878 gpu-permanent-node-517 real 0m0.394s user 0m0.006s sys 0m0.009s
- 您可以选择在将要使用的作业之前加载其他容器。 您可以在此处加载 NVIDIA NeMo Framework 容器,为 LLM 作业做准备。可能需要
~/.config/enroot/.credentials
中的 NVIDIA 验证信息才能访问 GA 或 EA 容器。$ cat .config/enroot/.credentials machine nvcr.io login $oauthtoken password vbmVtc2<snip>zU6YTFjNm $ time srun -N 2 --ntasks-per-node 1 \ --container-image="nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03" \ --container-name=nemo bash -c "hostname" pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03 pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03 gpu-permanent-node-878 gpu-permanent-node-517 real 46m27.272s
这个更大的集装箱需要近 47 分钟才能进口。
验证 GPU 和网络性能
NVIDIA NCCL 是 GPU 标准通信例程的独立库。Nccl-tests
报告平均 NCCL 操作时间(毫秒),算法带宽和总线带宽(GB/秒)。这些测试测量 GPU 和网络的性能,并验证操作的正确性。
- 从 GitHub 获取
NVIDIA nccl-tests
并通过运行以下命令在堡垒上构建可执行文件:$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \ bash -c "cd /home/opc; git clone https://github.com/NVIDIA/nccl-tests.git; cd nccl-tests; make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu"
- 运行
nccl-test
。以下命令在一个群集节点上使用八个 GPU 运行 NCCL
AllReduce
操作:$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \ --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; \ ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8" # nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 226178 on gpu-permanent-node-517 device 0 [0x0f] NVIDIA A100-SXM4-40GB # Rank 1 Group 0 Pid 226178 on gpu-permanent-node-517 device 1 [0x15] NVIDIA A100-SXM4-40GB # Rank 2 Group 0 Pid 226178 on gpu-permanent-node-517 device 2 [0x50] NVIDIA A100-SXM4-40GB # Rank 3 Group 0 Pid 226178 on gpu-permanent-node-517 device 3 [0x53] NVIDIA A100-SXM4-40GB # Rank 4 Group 0 Pid 226178 on gpu-permanent-node-517 device 4 [0x8c] NVIDIA A100-SXM4-40GB # Rank 5 Group 0 Pid 226178 on gpu-permanent-node-517 device 5 [0x91] NVIDIA A100-SXM4-40GB # Rank 6 Group 0 Pid 226178 on gpu-permanent-node-517 device 6 [0xd6] NVIDIA A100-SXM4-40GB # Rank 7 Group 0 Pid 226178 on gpu-permanent-node-517 device 7 [0xda] NVIDIA A100-SXM4-40GB # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 10737418240 2684354560 float sum -1 80130 134.00 234.50 0 80171 133.93 234.38 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 234.439 #
- 在具有 16 个 GPU 的两个群集节点上运行 NCCL
AllReduce
。此测试使用节点间群集网络。
bastion$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8" srun -N 2 --ntasks-per-node 1 --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; export NCCL_IB_QPS_PER_CONNECTION=4; export NCCL_IB_GID_INDEX=3; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8" # nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 231185 on gpu-permanent-node-517 device 0 [0x0f] NVIDIA A100-SXM4-40GB # Rank 1 Group 0 Pid 231185 on gpu-permanent-node-517 device 1 [0x15] NVIDIA A100-SXM4-40GB # Rank 2 Group 0 Pid 231185 on gpu-permanent-node-517 device 2 [0x50] NVIDIA A100-SXM4-40GB # Rank 3 Group 0 Pid 231185 on gpu-permanent-node-517 device 3 [0x53] NVIDIA A100-SXM4-40GB # Rank 4 Group 0 Pid 231185 on gpu-permanent-node-517 device 4 [0x8c] NVIDIA A100-SXM4-40GB # Rank 5 Group 0 Pid 231185 on gpu-permanent-node-517 device 5 [0x91] NVIDIA A100-SXM4-40GB # Rank 6 Group 0 Pid 231185 on gpu-permanent-node-517 device 6 [0xd6] NVIDIA A100-SXM4-40GB # Rank 7 Group 0 Pid 231185 on gpu-permanent-node-517 device 7 [0xda] NVIDIA A100-SXM4-40GB # Rank 8 Group 0 Pid 221811 on gpu-permanent-node-878 device 0 [0x0f] NVIDIA A100-SXM4-40GB # Rank 9 Group 0 Pid 221811 on gpu-permanent-node-878 device 1 [0x15] NVIDIA A100-SXM4-40GB # Rank 10 Group 0 Pid 221811 on gpu-permanent-node-878 device 2 [0x50] NVIDIA A100-SXM4-40GB # Rank 11 Group 0 Pid 221811 on gpu-permanent-node-878 device 3 [0x53] NVIDIA A100-SXM4-40GB # Rank 12 Group 0 Pid 221811 on gpu-permanent-node-878 device 4 [0x8c] NVIDIA A100-SXM4-40GB # Rank 13 Group 0 Pid 221811 on gpu-permanent-node-878 device 5 [0x91] NVIDIA A100-SXM4-40GB # Rank 14 Group 0 Pid 221811 on gpu-permanent-node-878 device 6 [0xd6] NVIDIA A100-SXM4-40GB # Rank 15 Group 0 Pid 221811 on gpu-permanent-node-878 device 7 [0xda] NVIDIA A100-SXM4-40GB # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 10737418240 2684354560 float sum -1 90752 118.32 221.84 0 90977 118.02 221.29 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 221.568 #
成功的结果表明集群已准备好运行生成式 AI 工作负载。