Review and Validate the Configuration
Review and the configuration, GPU, and network performance.
Review the Configuration
Log into the Bastion and review the configuration.
Edit the Slurm Configuration
By default Slurm removes containers automatically at the end of a job. Since
you likely will want to use the container again, it is much more efficient to make
containers persist across jobs with the container_scope
argument.
This will greatly accelerate subsequent restarts using the same container.
Pull or Upload the Containers
NVIDIA Pyxis plugin for Slurm with the Enroot container utility provide a cluster container execution environment integrated with Slurm workload manager. These components were installed when you checked the Pyxis and Enroot boxes during Software configuration.
See https://github.com/NVIDIA/pyxis/ or srun --help
for
details of the srun --container
options provided by Pyxis.
Validate the GPU and Network Performance
NVIDIA NCCL is a stand-alone library of standard communication routines for
GPUs. Nccl-tests
report the average NCCL operation time in ms, and
algorithm bandwidth and bus bandwidth in GB/s. These tests measure performance of the GPUs
and network, and also validate correctness of the operations.
For details see https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md.