Run NVIDIA NeMo Framework Training Jobs

NVIDIA NeMo Framework Launcher is cloud-native tool for launching end-to-end NeMo Framework training jobs across thousands of GPUs for large scale LLM training. In this example we use NeMo Framework Launcher to run the gpt3_5b large language model, data preparation and training stages.

See the NVIDIA documentation for more details on both NeMo and the NeMo Framework Launcher:

Run an LLM Training Workload

Install python and run a training workload.

  1. Install python 3.8 and make it the default python for user opc.

    For this Python3 is required, plus python modules listed in the requirements.txt. Oracle Linux 7 (and some other OS releases) still python2.

    $ sudo yum install -y oracle-softwarecollection-release-el7
    $ sudo yum -y install scl-utils rh-python38
    $ scl enable rh-python38 bash
    $ cat <<EOF >> ~/.bashrc
    [ -f /opt/rh/rh-python38/enable ] && source /opt/rh/rh-python38/enable
  2. Use the Installation commands at
    $ cd /nfs/scratch
    $ git clone
    $ cd NeMo-Megatron-Launcher
    $ pip install -r requirements.txt --user
    $ pip install --upgrade requests --user


    If users other than opc will be sharing the cluster, then you’ll need to install the python modules for all users with: sudo pip install -r requirements.txt.

Data Preparation

The data preparation stage performs three tasks: download “the pile” uncopyrighted dataset; extract (uncompress) the data; and preprocess the data.

  1. Edit launcher_scripts/conf/config.yaml:
    1. Set the job stage to data_preparation.
        - data_preparation
        #- training
    2. Set the launcher_scripts_path.
      # Path to NeMo Megatron Launch scripts, should ends with /launcher_scripts
      launcher_scripts_path: /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
  2. From the launcher_scripts directory run to submit the job to Slurm.
    $ cd /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
    $ python
    Jobs with status (ST) R indicate that some of the jobs are running. Other jobs shown are waiting for resources or waiting for other jobs to complete first (dependencies).

    Look in results/download_gpt3_pile/download to see output of completed and running jobs. You can also find the bash script that was used to submit the Slurm jobs. The script and the output may be useful for troubleshooting any jobs that do not run as expected.
On a two-node cluster the steps took:
  • 90 minutes to download
  • 46 minutes to extract
  • 5 hrs 45 minutes for preprocessing
Run times would be substantially lower on a larger cluster where data shards can be parallelized on up to 30 nodes.


In the training stage, you'll edit the scripts to perform LLM training.

  1. Edit launcher_scripts/conf/config.yaml:
    1. Set the job stage to training.
        #- data_preparation
        - training
    2. Add variables to configure NVIDIA NVLink on OCI.
      In the env_vars section, add these variables to configure NVIDIA NVLink on OCI. Leave existing variables in place, but comment out any vars that we are replacing with new values.
        TRANSFORMERS_OFFLINE: 0 # (was 1)
        . . .
        RX_QUEUE_LEN: 8192
        IB_RX_QUEUE_LEN: 8192
        UCX_TLS: tcp
        coll_hcoll_enable: 0
        UCX_NET_DEVICES: ens300
        NCCL_SOCKET_IFNAME: ens300
        NCCL_IB_TIMEOUT: 16
        NCCL_IB_SL: 0
        NCCL_IB_TC: 41
        NCCL_ALGO: Auto  # tree, ring
        NCCL_IB_GID_INDEX: 3
        NCCL_IB_QPS_PER_CONNECTION: 16  # was 4
        NCCL_IB_HCA: \'mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_11,mlx5_12,mlx5_14,mlx5_15,mlx5_16,mlx5_17\'
        NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information
  2. Edit conf/training/gpt3/5b.yaml.
      time_limit: "6-00:00:00"  # allow the training job to run for 6 days
      num_nodes: 2 # (was 16) set to the size of your cluster
      micro_batch_size: 2           # (was 4) change to fit in A100/40GB memory
      tensor_model_parallel_size: 2 # (was 1) change to fit in A100/40GB memory
        bucket_cap_mb: 200 # (was 400)
  3. From the launcher_scripts directory run to submit the job to Slurm. Use the command squeue to verify that the job is running (“R”).
    $ cd /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
    $ python
    Job nemo-megatron-gpt3_5b submission file created at '/nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts/results/gpt3_5b/'
    Job nemo-megatron-gpt3_5b submitted with Job ID 285
    $ squeue
                   285       gpu nemo-meg      opc  R       0:06      2 gpu-permanent-node-[517,878]
  4. Look at files in results/gpt3_5b/* to view output and error messages and monitor progress of running jobs.
    The LLM training is working if you see lines similar to this in the gpt3_5b_nnn.out:
    Training:   0%|          | 0/75375 [00:00<?]
    Epoch 0: :   0%|          | 0/75375 [00:00<?]
    This indicates that each training step completed in 41.5 seconds. Much faster step times can be achieved with more GPUs in the cluster.