运行 NVIDIA NeMo 框架培训作业

NVIDIA NeMo Framework Launcher 是云原生工具,用于在数千个 GPU 中启动端到端 NeMo 框架培训作业,用于大规模 LLM 培训。在此示例中,我们使用 NeMo Framework Launcher 运行 gpt3_5b large 语言模型、数据准备和培训阶段。

有关 NeMo 和 NeMo 框架启动器的更多详细信息,请参见 NVIDIA 文档:

运行 LLM 培训工作量

安装 python 并运行训练工作量。

  1. 安装 python 3.8 并使其成为用户 opc 的默认 python。

    对于此 Python3,必须加上 requirements.txt 中列出的 python 模块。Oracle Linux 7(以及其他一些 OS 发行版)仍为 python2。

    $ sudo yum install -y oracle-softwarecollection-release-el7
    $ sudo yum -y install scl-utils rh-python38
    $ scl enable rh-python38 bash
    $ cat <<EOF >> ~/.bashrc
    [ -f /opt/rh/rh-python38/enable ] && source /opt/rh/rh-python38/enable
    EOF
  2. 使用 Installation 命令,网址为 https://github.com/NVIDIA/NeMo-Megatron-Launcher.
    $ cd /nfs/scratch
    $ git clone https://github.com/NVIDIA/NeMo-Megatron-Launcher.git
    Cloning into 'NeMo-Megatron-Launcher'...
    remote: Enumerating objects: 29018, done.
    remote: Counting objects: 100% (1062/1062), done.
    remote: Compressing objects: 100% (452/452), done.
    remote: Total 29018 (delta 665), reused 898 (delta 564), pack-reused 27956
    Receiving objects: 100% (29018/29018), 27.66 MiB | 14.16 MiB/s, done.
    Resolving deltas: 100% (18124/18124), done.
    
    $ cd NeMo-Megatron-Launcher
    $ pip install -r requirements.txt --user
    $ pip install --upgrade requests --user

    注意:

    如果 opc 以外的用户将共享集群,则需要使用 sudo pip install -r requirements.txt 为所有用户安装 python 模块。

数据准备

数据准备阶段执行三个任务:下载“堆”未复制数据集;提取(解压缩)数据;以及预处理数据。

  1. 编辑 launcher_scripts/conf/config.yaml
    1. 将作业 stage 设置为 data_preparation
      stages:
        - data_preparation
        #- training
    2. 设置 launcher_scripts_path
      # Path to NeMo Megatron Launch scripts, should ends with /launcher_scripts
      launcher_scripts_path: /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
  2. launcher_scripts 目录运行 main.py 将作业提交到 Slurm。
    $ cd /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
    $ python main.py
    Job nemo-megatron-download_gpt3_pile submission file created at '/nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts/results/download_gpt3_pile/download/nemo-megatron-download_gpt3_pile_submission.sh' 
    . . .
  3. 使用 Slurm 的 squeue 命令观察程序作业状态。
    $ squeue
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        191_[20-29%30]       gpu nemo-meg      opc PD       0:00      1 (Resources)
            192_[0-29]       gpu nemo-meg      opc PD       0:00      1 (Dependency,Priority)
            193_[0-29]       gpu nemo-meg      opc PD       0:00      1 (Dependency)
                191_19       gpu nemo-meg      opc  R       0:50      1 gpu-permanent-node-517
                191_18       gpu nemo-meg      opc  R       0:56      1 gpu-permanent-node-878
    

    状态为 (ST) R 的作业指示某些作业正在运行。显示的其他工作正在等待资源或等待其他工作先完成(依赖项)。

    results/download_gpt3_pile/download 中查看已完成和正在运行的作业的输出。您还可以找到用于提交 Slurm 作业的 bash 脚本。脚本和输出对于排除未按预期运行的任何作业可能很有用。
在双节点群集上,所采取的步骤如下:
  • 90 分钟下载
  • 46 分钟提取
  • 5 小时 45 分钟进行预处理
在更大的集群上,运行时间将大大降低,其中数据分片可以在多达 30 个节点上并行化。

培训

在培训阶段,您将编辑脚本以执行 LLM 培训。

  1. 编辑 launcher_scripts/conf/config.yaml
    1. 将作业 stage 设置为 training
      stages:
        #- data_preparation
        - training
    2. 添加变量以在 OCI 上配置 NVIDIA NVLink。
      env_vars 部分中,添加这些变量以在 OCI 上配置 NVIDIA NVLink。保留现有变量,但注释掉要用新值替换的任何变量。
      env_vars:
        TRANSFORMERS_OFFLINE: 0 # (was 1)
        . . .
        RX_QUEUE_LEN: 8192
        IB_RX_QUEUE_LEN: 8192
        UCX_TLS: tcp
        HCOLL_ENABLE_MCAST_ALL: 0
        coll_hcoll_enable: 0
        UCX_NET_DEVICES: ens300
        NCCL_SOCKET_IFNAME: ens300
        NCCL_IB_TIMEOUT: 16
        NCCL_IB_SL: 0
        NCCL_IB_TC: 41
        NCCL_ALGO: Auto  # tree, ring
        NCCL_IB_GID_INDEX: 3
        NCCL_IB_QPS_PER_CONNECTION: 16  # was 4
        NCCL_IB_HCA: \'mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_11,mlx5_12,mlx5_14,mlx5_15,mlx5_16,mlx5_17\'
        NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information
  2. 编辑 conf/training/gpt3/5b.yaml
    run:
      time_limit: "6-00:00:00"  # allow the training job to run for 6 days
    
    trainer:
      num_nodes: 2 # (was 16) set to the size of your cluster
    
    model:
      micro_batch_size: 2           # (was 4) change to fit in A100/40GB memory
      tensor_model_parallel_size: 2 # (was 1) change to fit in A100/40GB memory
    
      optim:
        bucket_cap_mb: 200 # (was 400)
  3. launcher_scripts 目录运行 main.py 将作业提交到 Slurm。使用命令 squeue 验证作业是否正在运行 ("R")。
    $ cd /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
    $ python main.py
    Job nemo-megatron-gpt3_5b submission file created at '/nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts/results/gpt3_5b/nemo-megatron-gpt3_5b_submission.sh'
    Job nemo-megatron-gpt3_5b submitted with Job ID 285
    $ squeue
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                   285       gpu nemo-meg      opc  R       0:06      2 gpu-permanent-node-[517,878]
  4. 查看 results/gpt3_5b/* 中的文件以查看输出和错误消息并监视正在运行的作业的进度。
    如果您在 gpt3_5b_nnn.out 中看到与此类似的行,LLM 培训将起作用:
    Training:   0%|          | 0/75375 [00:00<?]
    Epoch 0: :   0%|          | 0/75375 [00:00<?]
    Epoch 0: :   0%|          | 1/75375 [00:52<1089:12:32]
    Epoch 0: :   0%|          | 1/75375 [00:52<1089:13:02 ... train_step_timing in s=52.00]
    Epoch 0: :   0%|          | 2/75375 [01:33<980:51:18 ... train_step_timing in s=52.00]
    Epoch 0: :   0%|          | 2/75375 [01:33<980:51:33 ... train_step_timing in s=46.80]
    Epoch 0: :   0%|          | 3/75375 [02:15<945:29:05 ... train_step_timing in s=46.80]
    Epoch 0: :   0%|          | 3/75375 [02:15<945:29:14 ... train_step_timing in s=45.20]
    Epoch 0: :   0%|          | 4/75375 [02:57<926:40:09 ... train_step_timing in s=45.20]
    Epoch 0: :   0%|          | 4/75375 [02:57<926:40:16 ... train_step_timing in s=44.30]
    Epoch 0: :   0%|          | 5/75375 [03:38<915:10:14 ... train_step_timing in s=44.30]
    Epoch 0: :   0%|          | 5/75375 [03:38<915:10:20 ... train_step_timing in s=43.70]
    Epoch 0: :   0%|          | 6/75375 [04:20<907:25:47 ... train_step_timing in s=43.70]
    Epoch 0: :   0%|          | 6/75375 [04:20<907:25:52 ... train_step_timing in s=41.60]
    Epoch 0: :   0%|          | 7/75375 [05:01<901:53:34 ... train_step_timing in s=41.60]
    Epoch 0: :   0%|          | 7/75375 [05:01<901:53:38 ... train_step_timing in s=41.60]
    Epoch 0: :   0%|          | 8/75375 [05:43<897:38:17 ... train_step_timing in s=41.60]
    Epoch 0: :   0%|          | 8/75375 [05:43<897:38:21 ... train_step_timing in s=41.50]
    Epoch 0: :   0%|          | 9/75375 [06:24<894:16:56 ... train_step_timing in s=41.50]
    Epoch 0: :   0%|          | 9/75375 [06:24<894:16:59 ... train_step_timing in s=41.50]
    Epoch 0: :   0%|          | 10/75375 [07:05<891:30:50 ... train_step_timing in s=41.50]

    这表明每个训练步骤在 41.5 秒内完成。借助集群中的更多 GPU,可以更快地完成步骤。