Review and Validate the Configuration

Review and the configuration, GPU, and network performance.

Review the Configuration

Log into the Bastion and review the configuration.

  1. Connect to the bastion with ssh as user opc (default for Oracle Linux instances) using the IP address and your private key.
    akua$ ssh -i ~/.ssh/cluster.key opc@139.87.214.247
    [opc@epsilon-bastion ~]$
  2. The df command shows mounted filesystems and capacities:
    [opc@epsilon-bastion ~]$ df -h | grep -v tmp
    Filesystem                     Size  Used Avail Use% Mounted on
    /dev/sda3                       92G   14G   79G  15% /                <- boot (home) volume
    /dev/sda1                      200M  7.4M  193M   4% /boot/efi
    /dev/sdb                        20T   58M   20T   1% /export/cluster  <- Additional volume
    172.16.0.75:/export/cluster     20T   57M   20T   1% /nfs/cluster
    172.16.6.4:/mnt/localdisk/nfs   13T   39G   13T   1% /nfs/scratch     <- worker node NVMe
  3. Edit the Slurm configuration.

    By default Slurm removes containers automatically at the end of a job. Since you likely will want to use the container again, it is much more efficient to make containers persist across jobs with the container_scope argument. This will greatly accelerate subsequent restarts using the same container.

    In file /etc/slurm/plugstack.conf, append container_scope=global so that it looks like the following:

    [opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
  4. Stop and restart Slurm on each of the GPU nodes and on the Bastion.
    A list of GPU node host names is shown in the output of sinfo. Use this in with the pdsh command to run systemctl on all nodes: export PS1="$ ".
    [opc@epsilon-bastion ~]$ export PS1="\n$ "
    
    $ sinfo
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]
    
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
    $ sudo systemctl restart slurmctld slurmdbd
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd
    
    $ /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global[opc@epsilon-bastion ~]$ export PS1="\n$ "
    
    $ sinfo
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]
    
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
    $ sudo systemctl restart slurmctld slurmdbd
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd
    
    $ /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
  5. Apply the OS updates.

    Consider updating the OS to the latest packages. Use pdsh as in the previous step to efficiently updates all nodes:

    # Oracle Linux 7:
    $ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo yum upgrade
    
    # Oracle Linux 8:
    $ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo dnf upgrade

Edit the Slurm Configuration

By default Slurm removes containers automatically at the end of a job. Since you likely will want to use the container again, it is much more efficient to make containers persist across jobs with the container_scope argument. This will greatly accelerate subsequent restarts using the same container.

  1. In file /etc/slurm/plugstack.conf, append container_scope=global so that it looks like the following:
    [opc@epsilon-bastion ~]$ cat /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global
  2. Stop and restart Slurm on each of the GPU nodes and on the Bastion.
    A list of GPU node host names is shown in the output of sinfo. Use this in with the pdsh command to run systemctl on all nodes: export PS1="$ "
    [opc@epsilon-bastion ~]$ export PS1="\n$ "
    
    $ sinfo
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    gpu*         up   infinite      2   idle gpu-permanent-node-[517,878]
    
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl stop slurmd
    $ sudo systemctl restart slurmctld slurmdbd
    $ pdsh -w gpu-permanent-node-[517,878] sudo systemctl start slurmd
    
    $ /etc/slurm/plugstack.conf
    required /usr/local/lib/slurm/spank_pyxis.so container_scope=global

Apply the OS Updates

Consider updating the OS to the latest packages.

  • Use pdsh as in the previous step to efficiently updates all nodes:
    # Oracle Linux 7:
    $ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo yum upgrade
    
    # Oracle Linux 8:
    $ pdsh -w localhost, gpu-permanent-node-[517,878]  sudo dnf upgrade

Pull or Upload the Containers

NVIDIA Pyxis plugin for Slurm with the Enroot container utility provide a cluster container execution environment integrated with Slurm workload manager. These components were installed when you checked the Pyxis and Enroot boxes during Software configuration.

See https://github.com/NVIDIA/pyxis/ or srun --help for details of the srun --container options provided by Pyxis.

  1. Verify that the container execution environment is working as expected in your cluster.
    This example pulls the TensorFlow container from Nvidia’s nvcr.io repository and runs a simple command. This will verify that the container execution environment is working as expected in your cluster. The first time this runs it downloads a large container from a remote location and may take 25 minutes or more to load and begin execution.
    $ srun -N 2 --ntasks-per-node 1 \
      --container-image=nvcr.io#nvidia/tensorflow:22.11-tf2-py3 \
      --container-name=tensorflow bash -c "hostname; grep PRETTY /etc/os-release"
    pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3
    pyxis: imported docker image: nvcr.io#nvidia/pytorch:21.09-py3
    gpu-permanent-node-517
    PRETTY_NAME="Ubuntu 20.04.3 LTS"
    gpu-permanent-node-878

    Subsequent jobs using the named container do not require a download, and will begin execution immediately.

    $ time srun -N 2 --ntasks-per-node 1 --container-name=tensorflow bash -c "hostname"
    gpu-permanent-node-878
    gpu-permanent-node-517
    
    real	0m0.394s
    user	0m0.006s
    sys	0m0.009s
  2. You may choose to load additional containers in advance of jobs that will use them.
    Here you can load the NVIDIA NeMo Framework container in preparation for a LLM job. NVIDIA authentication info in ~/.config/enroot/.credentials may be needed to gain access to GA or EA containers.
    $ cat .config/enroot/.credentials
    machine nvcr.io login $oauthtoken password vbmVtc2<snip>zU6YTFjNm
    $ time srun -N 2 --ntasks-per-node 1 \
      --container-image="nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03" \
      --container-name=nemo bash -c "hostname"
    pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03
    pyxis: imported docker image: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03
    gpu-permanent-node-878
    gpu-permanent-node-517
    
    real	46m27.272s

    This larger container took nearly 47 minutes to import.

Validate the GPU and Network Performance

NVIDIA NCCL is a stand-alone library of standard communication routines for GPUs. Nccl-tests report the average NCCL operation time in ms, and algorithm bandwidth and bus bandwidth in GB/s. These tests measure performance of the GPUs and network, and also validate correctness of the operations.

  1. Get NVIDIA nccl-tests from GitHub and builds the executables on Bastion by running the following command:
    $ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \
      bash -c "cd /home/opc; git clone https://github.com/NVIDIA/nccl-tests.git; cd nccl-tests; make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu"
  2. Run nccl-test.

    The following runs the NCCL AllReduce operation on one cluster node using eight GPUs:

    $ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" \
      --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; \
      ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
    # nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
    #
    # Using devices
    #  Rank  0 Group  0 Pid 226178 on gpu-permanent-node-517 device  0 [0x0f] NVIDIA A100-SXM4-40GB
    #  Rank  1 Group  0 Pid 226178 on gpu-permanent-node-517 device  1 [0x15] NVIDIA A100-SXM4-40GB
    #  Rank  2 Group  0 Pid 226178 on gpu-permanent-node-517 device  2 [0x50] NVIDIA A100-SXM4-40GB
    #  Rank  3 Group  0 Pid 226178 on gpu-permanent-node-517 device  3 [0x53] NVIDIA A100-SXM4-40GB
    #  Rank  4 Group  0 Pid 226178 on gpu-permanent-node-517 device  4 [0x8c] NVIDIA A100-SXM4-40GB
    #  Rank  5 Group  0 Pid 226178 on gpu-permanent-node-517 device  5 [0x91] NVIDIA A100-SXM4-40GB
    #  Rank  6 Group  0 Pid 226178 on gpu-permanent-node-517 device  6 [0xd6] NVIDIA A100-SXM4-40GB
    #  Rank  7 Group  0 Pid 226178 on gpu-permanent-node-517 device  7 [0xda] NVIDIA A100-SXM4-40GB
    #
    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     10737418240    2684354560     float     sum      -1    80130  134.00  234.50      0    80171  133.93  234.38      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 234.439
    #
  3. Run NCCL AllReduce on two cluster nodes with 16 GPUs.

    This test makes use of the inter-node cluster network.

    bastion$ srun --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
    srun -N 2 --ntasks-per-node 1 --container-name=tensorflow --container-mounts "/home/opc:/home/opc" --mpi pmi2 --gpus-per-node=8 bash -c "cd /home/opc/nccl-tests; export NCCL_IB_QPS_PER_CONNECTION=4; export NCCL_IB_GID_INDEX=3; ./build/all_reduce_perf -b 10G -e 10G -t 1 -g 8"
    # nThread 1 nGpus 8 minBytes 10737418240 maxBytes 10737418240 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
    #
    # Using devices
    #  Rank  0 Group  0 Pid 231185 on gpu-permanent-node-517 device  0 [0x0f] NVIDIA A100-SXM4-40GB
    #  Rank  1 Group  0 Pid 231185 on gpu-permanent-node-517 device  1 [0x15] NVIDIA A100-SXM4-40GB
    #  Rank  2 Group  0 Pid 231185 on gpu-permanent-node-517 device  2 [0x50] NVIDIA A100-SXM4-40GB
    #  Rank  3 Group  0 Pid 231185 on gpu-permanent-node-517 device  3 [0x53] NVIDIA A100-SXM4-40GB
    #  Rank  4 Group  0 Pid 231185 on gpu-permanent-node-517 device  4 [0x8c] NVIDIA A100-SXM4-40GB
    #  Rank  5 Group  0 Pid 231185 on gpu-permanent-node-517 device  5 [0x91] NVIDIA A100-SXM4-40GB
    #  Rank  6 Group  0 Pid 231185 on gpu-permanent-node-517 device  6 [0xd6] NVIDIA A100-SXM4-40GB
    #  Rank  7 Group  0 Pid 231185 on gpu-permanent-node-517 device  7 [0xda] NVIDIA A100-SXM4-40GB
    #  Rank  8 Group  0 Pid 221811 on gpu-permanent-node-878 device  0 [0x0f] NVIDIA A100-SXM4-40GB
    #  Rank  9 Group  0 Pid 221811 on gpu-permanent-node-878 device  1 [0x15] NVIDIA A100-SXM4-40GB
    #  Rank 10 Group  0 Pid 221811 on gpu-permanent-node-878 device  2 [0x50] NVIDIA A100-SXM4-40GB
    #  Rank 11 Group  0 Pid 221811 on gpu-permanent-node-878 device  3 [0x53] NVIDIA A100-SXM4-40GB
    #  Rank 12 Group  0 Pid 221811 on gpu-permanent-node-878 device  4 [0x8c] NVIDIA A100-SXM4-40GB
    #  Rank 13 Group  0 Pid 221811 on gpu-permanent-node-878 device  5 [0x91] NVIDIA A100-SXM4-40GB
    #  Rank 14 Group  0 Pid 221811 on gpu-permanent-node-878 device  6 [0xd6] NVIDIA A100-SXM4-40GB
    #  Rank 15 Group  0 Pid 221811 on gpu-permanent-node-878 device  7 [0xda] NVIDIA A100-SXM4-40GB
    #
    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     10737418240    2684354560     float     sum      -1    90752  118.32  221.84      0    90977  118.02  221.29      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 221.568
    #

    Successful results indicate that the cluster is ready to run your generative AI workloads.