執行 NVIDIA NeMo Framework 訓練工作

NVIDIA NeMo Framework Launcher 是雲端原生工具，可用於跨數千個 GPU 啟動端對端 NeMo 架構訓練工作，以進行大規模的 LLM 訓練。在此範例中，我們使用 NeMo Framework Launcher 來執行 gpt3_5b large 語言模型、資料準備和訓練階段。

如需 NeMo 和 NeMo Framework Launcher 的詳細資訊，請參閱 NVIDIA 文件：

執行 LLM 訓練工作負載

安裝 python 並執行訓練工作負載。

安裝 python 3.8，並將其設為使用者 opc 的預設 python。

需要此 Python3，加上 requirements.txt 中所列的 python 模組。Oracle Linux 7 (和其他作業系統版本) 仍然是 python2。

$ sudo yum install -y oracle-softwarecollection-release-el7
$ sudo yum -y install scl-utils rh-python38
$ scl enable rh-python38 bash
$ cat <<EOF >> ~/.bashrc
[ -f /opt/rh/rh-python38/enable ] && source /opt/rh/rh-python38/enable
EOF

使用位於 https://github.com/NVIDIA/NeMo-Megatron-Launcher. 的 Installation 指令

$ cd /nfs/scratch
$ git clone https://github.com/NVIDIA/NeMo-Megatron-Launcher.git
Cloning into 'NeMo-Megatron-Launcher'...
remote: Enumerating objects: 29018, done.
remote: Counting objects: 100% (1062/1062), done.
remote: Compressing objects: 100% (452/452), done.
remote: Total 29018 (delta 665), reused 898 (delta 564), pack-reused 27956
Receiving objects: 100% (29018/29018), 27.66 MiB | 14.16 MiB/s, done.
Resolving deltas: 100% (18124/18124), done.

$ cd NeMo-Megatron-Launcher
$ pip install -r requirements.txt --user
$ pip install --upgrade requests --user

附註：

如果 opc 以外的使用者將共用叢集，則您需要為所有使用者安裝 python 模組：sudo pip install -r requirements.txt。

準備資料

資料準備階段會執行三項任務：下載 "the pile" unopyrighted dataset；擷取 (解壓縮) 資料；以及預先處理資料。

編輯 launcher_scripts/conf/config.yaml：

將工作 stage 設為 data_preparation。

stages:
  - data_preparation
  #- training

設定 launcher_scripts_path。

# Path to NeMo Megatron Launch scripts, should ends with /launcher_scripts
launcher_scripts_path: /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts

從 launcher_scripts 目錄執行 main.py，將工作提交至 Slurm。

$ cd /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
$ python main.py
Job nemo-megatron-download_gpt3_pile submission file created at '/nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts/results/download_gpt3_pile/download/nemo-megatron-download_gpt3_pile_submission.sh' 
. . .

使用 Slurm 的 squeue 指令來觀察工作狀態。

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    191_[20-29%30]       gpu nemo-meg      opc PD       0:00      1 (Resources)
        192_[0-29]       gpu nemo-meg      opc PD       0:00      1 (Dependency,Priority)
        193_[0-29]       gpu nemo-meg      opc PD       0:00      1 (Dependency)
            191_19       gpu nemo-meg      opc  R       0:50      1 gpu-permanent-node-517
            191_18       gpu nemo-meg      opc  R       0:56      1 gpu-permanent-node-878

狀態為 (ST) R 的工作表示部分工作正在執行中。顯示的其他工作正在等待資源，或等待其他工作先完成 (相依性)。

請查看 results/download_gpt3_pile/download 以查看已完成與執行中工作的輸出。您也可以找到用來提交 Slurm 工作的 bash 程序檔。此命令檔與輸出對於疑難排解任何未如預期執行的工作可能非常有用。

在雙節點叢集上，步驟採取了：

下載時間為 90 分鐘
46 分鐘擷取
5 小時 45 分鐘前處理

較大叢集的執行時間會大幅降低，資料分區最多可在 30 個節點上平行化。

教育訓練

在訓練階段，您將編輯命令檔以執行 LLM 訓練。

編輯 launcher_scripts/conf/config.yaml：

將工作 stage 設為 training。

stages:
  #- data_preparation
  - training

新增變數以在 OCI 上設定 NVIDIA NVLink。

在 env_vars 區段中，新增這些變數以在 OCI 上設定 NVIDIA NVLink。將既有的變數保留在原位，但是將取代為新值的任何變數標記為註釋。

env_vars:
  TRANSFORMERS_OFFLINE: 0 # (was 1)
  . . .
  RX_QUEUE_LEN: 8192
  IB_RX_QUEUE_LEN: 8192
  UCX_TLS: tcp
  HCOLL_ENABLE_MCAST_ALL: 0
  coll_hcoll_enable: 0
  UCX_NET_DEVICES: ens300
  NCCL_SOCKET_IFNAME: ens300
  NCCL_IB_TIMEOUT: 16
  NCCL_IB_SL: 0
  NCCL_IB_TC: 41
  NCCL_ALGO: Auto  # tree, ring
  NCCL_IB_GID_INDEX: 3
  NCCL_IB_QPS_PER_CONNECTION: 16  # was 4
  NCCL_IB_HCA: \'mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_11,mlx5_12,mlx5_14,mlx5_15,mlx5_16,mlx5_17\'
  NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information

請參閱 https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html 以取得參考。

編輯 conf/training/gpt3/5b.yaml。

run:
  time_limit: "6-00:00:00"  # allow the training job to run for 6 days

trainer:
  num_nodes: 2 # (was 16) set to the size of your cluster

model:
  micro_batch_size: 2           # (was 4) change to fit in A100/40GB memory
  tensor_model_parallel_size: 2 # (was 1) change to fit in A100/40GB memory

  optim:
    bucket_cap_mb: 200 # (was 400)

從 launcher_scripts 目錄執行 main.py 將工作送出至 Slurm。請使用命令 squeue 來確認工作是否在執行中 (R)。

$ cd /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
$ python main.py
Job nemo-megatron-gpt3_5b submission file created at '/nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts/results/gpt3_5b/nemo-megatron-gpt3_5b_submission.sh'
Job nemo-megatron-gpt3_5b submitted with Job ID 285
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               285       gpu nemo-meg      opc  R       0:06      2 gpu-permanent-node-[517,878]

查看 results/gpt3_5b/* 中的檔案來檢視輸出和錯誤訊息，以及監督執行中工作的進度。

如果您在 gpt3_5b_nnn.out 中看到類似以下的行，則 LLM 訓練會正常運作：

Training:   0%|          | 0/75375 [00:00<?]
Epoch 0: :   0%|          | 0/75375 [00:00<?]
Epoch 0: :   0%|          | 1/75375 [00:52<1089:12:32]
Epoch 0: :   0%|          | 1/75375 [00:52<1089:13:02 ... train_step_timing in s=52.00]
Epoch 0: :   0%|          | 2/75375 [01:33<980:51:18 ... train_step_timing in s=52.00]
Epoch 0: :   0%|          | 2/75375 [01:33<980:51:33 ... train_step_timing in s=46.80]
Epoch 0: :   0%|          | 3/75375 [02:15<945:29:05 ... train_step_timing in s=46.80]
Epoch 0: :   0%|          | 3/75375 [02:15<945:29:14 ... train_step_timing in s=45.20]
Epoch 0: :   0%|          | 4/75375 [02:57<926:40:09 ... train_step_timing in s=45.20]
Epoch 0: :   0%|          | 4/75375 [02:57<926:40:16 ... train_step_timing in s=44.30]
Epoch 0: :   0%|          | 5/75375 [03:38<915:10:14 ... train_step_timing in s=44.30]
Epoch 0: :   0%|          | 5/75375 [03:38<915:10:20 ... train_step_timing in s=43.70]
Epoch 0: :   0%|          | 6/75375 [04:20<907:25:47 ... train_step_timing in s=43.70]
Epoch 0: :   0%|          | 6/75375 [04:20<907:25:52 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 7/75375 [05:01<901:53:34 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 7/75375 [05:01<901:53:38 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 8/75375 [05:43<897:38:17 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 8/75375 [05:43<897:38:21 ... train_step_timing in s=41.50]
Epoch 0: :   0%|          | 9/75375 [06:24<894:16:56 ... train_step_timing in s=41.50]
Epoch 0: :   0%|          | 9/75375 [06:24<894:16:59 ... train_step_timing in s=41.50]
Epoch 0: :   0%|          | 10/75375 [07:05<891:30:50 ... train_step_timing in s=41.50]

這表示每個訓練步驟會在 41.5 秒內完成。叢集中有更多 GPU 可加快步伐。