Run NVIDIA NeMo Framework Training Jobs

NVIDIA NeMo Framework Launcher is cloud-native tool for launching end-to-end NeMo Framework training jobs across thousands of GPUs for large scale LLM training. In this example we use NeMo Framework Launcher to run the gpt3_5b large language model, data preparation and training stages.

See the NVIDIA documentation for more details on both NeMo and the NeMo Framework Launcher:

Run an LLM Training Workload

Install python and run a training workload.

Install python 3.8 and make it the default python for user opc.

For this Python3 is required, plus python modules listed in the requirements.txt. Oracle Linux 7 (and some other OS releases) still python2.

$ sudo yum install -y oracle-softwarecollection-release-el7
$ sudo yum -y install scl-utils rh-python38
$ scl enable rh-python38 bash
$ cat <<EOF >> ~/.bashrc
[ -f /opt/rh/rh-python38/enable ] && source /opt/rh/rh-python38/enable
EOF

Use the Installation commands at https://github.com/NVIDIA/NeMo-Megatron-Launcher.

$ cd /nfs/scratch
$ git clone https://github.com/NVIDIA/NeMo-Megatron-Launcher.git
Cloning into 'NeMo-Megatron-Launcher'...
remote: Enumerating objects: 29018, done.
remote: Counting objects: 100% (1062/1062), done.
remote: Compressing objects: 100% (452/452), done.
remote: Total 29018 (delta 665), reused 898 (delta 564), pack-reused 27956
Receiving objects: 100% (29018/29018), 27.66 MiB | 14.16 MiB/s, done.
Resolving deltas: 100% (18124/18124), done.

$ cd NeMo-Megatron-Launcher
$ pip install -r requirements.txt --user
$ pip install --upgrade requests --user

Note:

If users other than opc will be sharing the cluster, then you’ll need to install the python modules for all users with: sudo pip install -r requirements.txt.

Data Preparation

The data preparation stage performs three tasks: download “the pile” uncopyrighted dataset; extract (uncompress) the data; and preprocess the data.

Edit launcher_scripts/conf/config.yaml:

Set the job stage to data_preparation.

stages:
  - data_preparation
  #- training

Set the launcher_scripts_path.

# Path to NeMo Megatron Launch scripts, should ends with /launcher_scripts
launcher_scripts_path: /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts

From the launcher_scripts directory run main.py to submit the job to Slurm.

$ cd /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
$ python main.py
Job nemo-megatron-download_gpt3_pile submission file created at '/nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts/results/download_gpt3_pile/download/nemo-megatron-download_gpt3_pile_submission.sh' 
. . .

Use Slurm’s squeue command to observer job status.
```
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    191_[20-29%30]       gpu nemo-meg      opc PD       0:00      1 (Resources)
        192_[0-29]       gpu nemo-meg      opc PD       0:00      1 (Dependency,Priority)
        193_[0-29]       gpu nemo-meg      opc PD       0:00      1 (Dependency)
            191_19       gpu nemo-meg      opc  R       0:50      1 gpu-permanent-node-517
            191_18       gpu nemo-meg      opc  R       0:56      1 gpu-permanent-node-878
```
Jobs with status (ST) R indicate that some of the jobs are running. Other jobs shown are waiting for resources or waiting for other jobs to complete first (dependencies).

Look in results/download_gpt3_pile/download to see output of completed and running jobs. You can also find the bash script that was used to submit the Slurm jobs. The script and the output may be useful for troubleshooting any jobs that do not run as expected.

On a two-node cluster the steps took:

90 minutes to download
46 minutes to extract
5 hrs 45 minutes for preprocessing

Run times would be substantially lower on a larger cluster where data shards can be parallelized on up to 30 nodes.

Training

In the training stage, you'll edit the scripts to perform LLM training.

Edit launcher_scripts/conf/config.yaml:

Set the job stage to training.

stages:
  #- data_preparation
  - training

Add variables to configure NVIDIA NVLink on OCI.

In the env_vars section, add these variables to configure NVIDIA NVLink on OCI. Leave existing variables in place, but comment out any vars that we are replacing with new values.

env_vars:
  TRANSFORMERS_OFFLINE: 0 # (was 1)
  . . .
  RX_QUEUE_LEN: 8192
  IB_RX_QUEUE_LEN: 8192
  UCX_TLS: tcp
  HCOLL_ENABLE_MCAST_ALL: 0
  coll_hcoll_enable: 0
  UCX_NET_DEVICES: ens300
  NCCL_SOCKET_IFNAME: ens300
  NCCL_IB_TIMEOUT: 16
  NCCL_IB_SL: 0
  NCCL_IB_TC: 41
  NCCL_ALGO: Auto  # tree, ring
  NCCL_IB_GID_INDEX: 3
  NCCL_IB_QPS_PER_CONNECTION: 16  # was 4
  NCCL_IB_HCA: \'mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_11,mlx5_12,mlx5_14,mlx5_15,mlx5_16,mlx5_17\'
  NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information

See https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html for reference.

Edit conf/training/gpt3/5b.yaml.

run:
  time_limit: "6-00:00:00"  # allow the training job to run for 6 days

trainer:
  num_nodes: 2 # (was 16) set to the size of your cluster

model:
  micro_batch_size: 2           # (was 4) change to fit in A100/40GB memory
  tensor_model_parallel_size: 2 # (was 1) change to fit in A100/40GB memory

  optim:
    bucket_cap_mb: 200 # (was 400)

From the launcher_scripts directory run main.py to submit the job to Slurm. Use the command squeue to verify that the job is running (“R”).

$ cd /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
$ python main.py
Job nemo-megatron-gpt3_5b submission file created at '/nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts/results/gpt3_5b/nemo-megatron-gpt3_5b_submission.sh'
Job nemo-megatron-gpt3_5b submitted with Job ID 285
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               285       gpu nemo-meg      opc  R       0:06      2 gpu-permanent-node-[517,878]

Look at files in results/gpt3_5b/* to view output and error messages and monitor progress of running jobs.

The LLM training is working if you see lines similar to this in the gpt3_5b_nnn.out:

Training:   0%|          | 0/75375 [00:00<?]
Epoch 0: :   0%|          | 0/75375 [00:00<?]
Epoch 0: :   0%|          | 1/75375 [00:52<1089:12:32]
Epoch 0: :   0%|          | 1/75375 [00:52<1089:13:02 ... train_step_timing in s=52.00]
Epoch 0: :   0%|          | 2/75375 [01:33<980:51:18 ... train_step_timing in s=52.00]
Epoch 0: :   0%|          | 2/75375 [01:33<980:51:33 ... train_step_timing in s=46.80]
Epoch 0: :   0%|          | 3/75375 [02:15<945:29:05 ... train_step_timing in s=46.80]
Epoch 0: :   0%|          | 3/75375 [02:15<945:29:14 ... train_step_timing in s=45.20]
Epoch 0: :   0%|          | 4/75375 [02:57<926:40:09 ... train_step_timing in s=45.20]
Epoch 0: :   0%|          | 4/75375 [02:57<926:40:16 ... train_step_timing in s=44.30]
Epoch 0: :   0%|          | 5/75375 [03:38<915:10:14 ... train_step_timing in s=44.30]
Epoch 0: :   0%|          | 5/75375 [03:38<915:10:20 ... train_step_timing in s=43.70]
Epoch 0: :   0%|          | 6/75375 [04:20<907:25:47 ... train_step_timing in s=43.70]
Epoch 0: :   0%|          | 6/75375 [04:20<907:25:52 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 7/75375 [05:01<901:53:34 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 7/75375 [05:01<901:53:38 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 8/75375 [05:43<897:38:17 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 8/75375 [05:43<897:38:21 ... train_step_timing in s=41.50]
Epoch 0: :   0%|          | 9/75375 [06:24<894:16:56 ... train_step_timing in s=41.50]
Epoch 0: :   0%|          | 9/75375 [06:24<894:16:59 ... train_step_timing in s=41.50]
Epoch 0: :   0%|          | 10/75375 [07:05<891:30:50 ... train_step_timing in s=41.50]

This indicates that each training step completed in 41.5 seconds. Much faster step times can be achieved with more GPUs in the cluster.