Konfiguration prüfen und validieren

Prüfen Sie die Konfiguration, GPU und Netzwerkperformance.

Prüfen Sie die Konfiguration

Melden Sie sich bei der Bastion an, und prüfen Sie die Konfiguration.

Stellen Sie mit ssh als Benutzer opc (Standard für Oracle Linux-Instanzen) über die IP-Adresse und Ihren Private Key eine Verbindung zur Bastion her.
```
akua$ ssh -i ~/.ssh/cluster.key opc@139.87.214.247
[opc@epsilon-bastion ~]$
```

Der Befehl df zeigt gemountete Dateisysteme und Kapazitäten an:

[opc@epsilon-bastion ~]$ df -h | grep -v tmp
Filesystem                     Size  Used Avail Use% Mounted on
/dev/sda3                       92G   14G   79G  15% /                <- boot (home) volume
/dev/sda1                      200M  7.4M  193M   4% /boot/efi
/dev/sdb                        20T   58M   20T   1% /export/cluster  <- Additional volume
172.16.0.75:/export/cluster     20T   57M   20T   1% /nfs/cluster
172.16.6.4:/mnt/localdisk/nfs   13T   39G   13T   1% /nfs/scratch     <- worker node NVMe

Bearbeiten Sie die Slurm-Konfiguration.

Standardmäßig entfernt Slurm Container automatisch am Ende eines Jobs. Da Sie den Container wahrscheinlich wieder verwenden möchten, ist es viel effizienter, Container über Jobs hinweg mit dem Argument container_scope persistent zu machen. Dadurch werden nachfolgende Neustarts mit demselben Container erheblich beschleunigt.

Hängen Sie in der Datei /etc/slurm/plugstack.conf container_scope=global an, sodass sie wie folgt aussieht:
```
[opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
```

Stoppen und starten Sie Slurm auf jedem der GPU-Knoten und auf der Bastion neu.

Eine Liste der GPU-Knotenhostnamen wird in der Ausgabe von sinfo angezeigt. Verwenden Sie diese Option in Verbindung mit dem Befehl pdsh, um systemctl auf allen Knoten auszuführen: export PS1="$ ".

[opc@epsilon-bastion ~]$ export PS1="\n$ "

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]

$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
$ sudo systemctl restart slurmctld slurmdbd
$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd

$ /etc/slurm/plugstack.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global[opc@epsilon-bastion ~]$ export PS1="\n$ "

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]

$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
$ sudo systemctl restart slurmctld slurmdbd
$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd

$ /etc/slurm/plugstack.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global

Spielen Sie die Betriebssystemupdates ein.

Sie sollten das BS auf die neuesten Packages aktualisieren. Verwenden Sie pdsh wie im vorherigen Schritt, um alle Knoten effizient zu aktualisieren:
```
# Oracle Linux 7:
$ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo yum upgrade

# Oracle Linux 8:
$ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo dnf upgrade
```

Slurm-Konfiguration bearbeiten

Standardmäßig entfernt Slurm Container automatisch am Ende eines Jobs. Da Sie den Container wahrscheinlich wieder verwenden möchten, ist es viel effizienter, Container über Jobs hinweg mit dem Argument container_scope persistent zu machen. Dadurch werden nachfolgende Neustarts mit demselben Container erheblich beschleunigt.

Hängen Sie in der Datei /etc/slurm/plugstack.conf container_scope=global an, sodass sie wie folgt aussieht:

[opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global

Stoppen und starten Sie Slurm auf jedem der GPU-Knoten und auf der Bastion neu.

[opc@epsilon-bastion ~]$ export PS1="\n$ "

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]

$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
$ sudo systemctl restart slurmctld slurmdbd
$ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd

$ /etc/slurm/plugstack.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global

Betriebssystemupdates einspielen

Sie sollten das BS auf die neuesten Packages aktualisieren.

Verwenden Sie pdsh wie im vorherigen Schritt, um alle Knoten effizient zu aktualisieren:

# Oracle Linux 7:
$ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo yum upgrade

# Oracle Linux 8:
$ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo dnf upgrade

Container entnehmen oder hochladen

NVIDIA Pyxis-Plugin für Slurm mit dem Enroot-Container-Utility stellt eine Cluster-Container-Ausführungsumgebung bereit, die in Slurm-Workload-Manager integriert ist. Diese Komponenten wurden installiert, als Sie die Kontrollkästchen Pyxis und Enroot während der Softwarekonfiguration aktiviert haben.

Einzelheiten zu den von Pyxis bereitgestellten srun --container-Optionen finden Sie unter https://github.com/NVIDIA/pyxis/ oder srun --help.

Prüfen Sie, ob die Containerausführungsumgebung wie erwartet in Ihrem Cluster funktioniert.
In diesem Beispiel wird der Container TensorFlow aus dem Repository nvcr.io von Nvidia abgerufen und ein einfacher Befehl ausgeführt. Dadurch wird geprüft, ob die Containerausführungsumgebung in Ihrem Cluster wie erwartet funktioniert. Bei der ersten Ausführung lädt er einen großen Container von einem entfernten Speicherort herunter und kann 25 Minuten oder länger dauern, bis er geladen ist und die Ausführung beginnt.
```
$ srun -N 2 --ntasks-per-node 1 \
  --container-image=nvcr.io#nvidia/tensorflow:22.11-tf2-py3 \
  --container-name=tensorflow bash -c "hostname; grep PRETTY /etc/os-release"
pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3
pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3
gpu-permanent-node-517
PRETTY_NAME="Ubuntu 20.04.3 LTS"
gpu-permanent-node-878
```
Nachfolgende Jobs, die den benannten Container verwenden, erfordern keinen Download und beginnen sofort mit der Ausführung.
```
$ time srun -N 2 --ntasks-per-node 1 --container-name=tensorflow bash -c "hostname"
gpu-permanent-node-878
gpu-permanent-node-517

real	0m0.394s
user	0m0.006s
sys	0m0.009s
```

Sie können zusätzliche Container vor Jobs laden, die sie verwenden.

Hier können Sie den NVIDIA NeMo Framework-Container zur Vorbereitung auf einen LLM-Job laden. NVIDIA-Authentifizierungsinformationen in ~/.config/enroot/.credentials sind möglicherweise erforderlich, um Zugriff auf GA- oder EA-Container zu erhalten.

$ cat .config/enroot/.credentials
machine nvcr.io login $oauthtoken password vbmVtc2<snip>zU6YTFjNm
$ time srun -N 2 --ntasks-per-node 1 \
  --container-image="nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03" \
  --container-name=nemo bash -c "hostname"
pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03
pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03
gpu-permanent-node-878
gpu-permanent-node-517

real	46m27.272s

Der Import dieses größeren Containers dauerte fast 47 Minuten.

GPU- und Netzwerkperformance validieren

NVIDIA NCCL ist eine eigenständige Bibliothek mit Standardkommunikationsroutinen für GPUs. NCCL-tests meldet die durchschnittliche NCCL-Betriebszeit in ms und die Algorithmusbandbreite und die Busbandbreite in GB/s. Diese Tests messen die Performance der GPUs und des Netzwerks und validieren auch die Richtigkeit der Vorgänge.

Weitere Informationen finden Sie unter https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md.

Rufen Sie NVIDIA nccl-tests aus GitHub ab, und erstellt die ausführbaren Dateien auf Bastion, indem Sie den folgenden Befehl ausführen:

$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \
  bash -c "cd /home/opc; git clone https://github.com/NVIDIA/nccl-tests.git; cd nccl-tests; make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu"

Führen Sie nccl-test aus.

Im Folgenden wird der NCCL-Vorgang AllReduce auf einem Clusterknoten mit acht GPUs ausgeführt:

$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \
  --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; \
  ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
# nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 226178 on gpu-permanent-node-517 device  0 [0x0f] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 226178 on gpu-permanent-node-517 device  1 [0x15] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 226178 on gpu-permanent-node-517 device  2 [0x50] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 226178 on gpu-permanent-node-517 device  3 [0x53] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid 226178 on gpu-permanent-node-517 device  4 [0x8c] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid 226178 on gpu-permanent-node-517 device  5 [0x91] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid 226178 on gpu-permanent-node-517 device  6 [0xd6] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid 226178 on gpu-permanent-node-517 device  7 [0xda] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
 10737418240    2684354560     float     sum      -1    80130  134.00  234.50      0    80171  133.93  234.38      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 234.439
#

Führen Sie NCCL AllReduce auf zwei Clusterknoten mit 16 GPUs aus.

Bei diesem Test wird das Clusternetzwerk zwischen Knoten verwendet.

bastion$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
srun -N 2 --ntasks-per-node 1 --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; export NCCL_IB_QPS_PER_CONNECTION=4; export NCCL_IB_GID_INDEX=3; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
# nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 231185 on gpu-permanent-node-517 device  0 [0x0f] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 231185 on gpu-permanent-node-517 device  1 [0x15] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 231185 on gpu-permanent-node-517 device  2 [0x50] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 231185 on gpu-permanent-node-517 device  3 [0x53] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid 231185 on gpu-permanent-node-517 device  4 [0x8c] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid 231185 on gpu-permanent-node-517 device  5 [0x91] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid 231185 on gpu-permanent-node-517 device  6 [0xd6] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid 231185 on gpu-permanent-node-517 device  7 [0xda] NVIDIA A100-SXM4-40GB
#  Rank  8 Group  0 Pid 221811 on gpu-permanent-node-878 device  0 [0x0f] NVIDIA A100-SXM4-40GB
#  Rank  9 Group  0 Pid 221811 on gpu-permanent-node-878 device  1 [0x15] NVIDIA A100-SXM4-40GB
#  Rank 10 Group  0 Pid 221811 on gpu-permanent-node-878 device  2 [0x50] NVIDIA A100-SXM4-40GB
#  Rank 11 Group  0 Pid 221811 on gpu-permanent-node-878 device  3 [0x53] NVIDIA A100-SXM4-40GB
#  Rank 12 Group  0 Pid 221811 on gpu-permanent-node-878 device  4 [0x8c] NVIDIA A100-SXM4-40GB
#  Rank 13 Group  0 Pid 221811 on gpu-permanent-node-878 device  5 [0x91] NVIDIA A100-SXM4-40GB
#  Rank 14 Group  0 Pid 221811 on gpu-permanent-node-878 device  6 [0xd6] NVIDIA A100-SXM4-40GB
#  Rank 15 Group  0 Pid 221811 on gpu-permanent-node-878 device  7 [0xda] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
 10737418240    2684354560     float     sum      -1    90752  118.32  221.84      0    90977  118.02  221.29      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 221.568
#

Erfolgreiche Ergebnisse zeigen, dass das Cluster zur Ausführung Ihrer generativen KI-Workloads bereit ist.