Esegui job di formazione su NVIDIA NeMo Framework

NVIDIA NeMo Framework Launcher è uno strumento cloud nativo per l'avvio di job di formazione NeMo Framework end-to-end su migliaia di GPU per l'addestramento LLM su larga scala. In questo esempio viene utilizzato NeMo Framework Launcher per eseguire il modello di lingua gpt3_5b large, la preparazione dei dati e le fasi di formazione.

Per ulteriori dettagli sia su NeMo che su NeMo Framework Launcher, consultare la documentazione NVIDIA.

Esegui carico di lavoro formazione LLM

Installare python ed eseguire un carico di lavoro di formazione.

Installare python 3.8 e renderlo il python predefinito per l'utente opc.

Per questo Python3 è obbligatorio, più i moduli python elencati in requirements.txt. Oracle Linux 7 (e altre release del sistema operativo) è ancora python2.

$ sudo yum install -y oracle-softwarecollection-release-el7
$ sudo yum -y install scl-utils rh-python38
$ scl enable rh-python38 bash
$ cat <<EOF >> ~/.bashrc
[ -f /opt/rh/rh-python38/enable ] && source /opt/rh/rh-python38/enable
EOF

Utilizzare i comandi Installation in https://github.com/NVIDIA/NeMo-Megatron-Launcher.

$ cd /nfs/scratch
$ git clone https://github.com/NVIDIA/NeMo-Megatron-Launcher.git
Cloning into 'NeMo-Megatron-Launcher'...
remote: Enumerating objects: 29018, done.
remote: Counting objects: 100% (1062/1062), done.
remote: Compressing objects: 100% (452/452), done.
remote: Total 29018 (delta 665), reused 898 (delta 564), pack-reused 27956
Receiving objects: 100% (29018/29018), 27.66 MiB | 14.16 MiB/s, done.
Resolving deltas: 100% (18124/18124), done.

$ cd NeMo-Megatron-Launcher
$ pip install -r requirements.txt --user
$ pip install --upgrade requests --user

Nota

Se utenti diversi da opc condivideranno il cluster, sarà necessario installare i moduli python per tutti gli utenti con: sudo pip install -r requirements.txt.

Preparazione dati

La fase di preparazione dei dati esegue tre attività: scaricare il data set non protetto da copyright "the pile", estrarre (non comprimere) i dati e pre-elaborare i dati.

Modificare launcher_scripts/conf/config.yaml:

Impostare il job stage su data_preparation.

stages:
  - data_preparation
  #- training

Impostare launcher_scripts_path.

# Path to NeMo Megatron Launch scripts, should ends with /launcher_scripts
launcher_scripts_path: /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts

Dalla directory launcher_scripts eseguire main.py per sottomettere il job a Slurm.

$ cd /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
$ python main.py
Job nemo-megatron-download_gpt3_pile submission file created at '/nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts/results/download_gpt3_pile/download/nemo-megatron-download_gpt3_pile_submission.sh' 
. . .

Utilizzare il comando squeue di Slurm per osservare lo stato del job.
```
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    191_[20-29%30]       gpu nemo-meg      opc PD       0:00      1 (Resources)
        192_[0-29]       gpu nemo-meg      opc PD       0:00      1 (Dependency,Priority)
        193_[0-29]       gpu nemo-meg      opc PD       0:00      1 (Dependency)
            191_19       gpu nemo-meg      opc  R       0:50      1 gpu-permanent-node-517
            191_18       gpu nemo-meg      opc  R       0:56      1 gpu-permanent-node-878
```
I job con stato (ST) R indicano che alcuni dei job sono in esecuzione. Gli altri job visualizzati sono in attesa delle risorse o in attesa del completamento iniziale di altri job (dipendenze).

Cercare in results/download_gpt3_pile/download per visualizzare l'output dei job completati e in esecuzione. È inoltre possibile trovare lo script bash utilizzato per sottomettere i job Slurm. Lo script e l'output possono essere utili per la risoluzione di eventuali job che non vengono eseguiti come previsto.

In un cluster a due nodi i passi sono stati eseguiti:

90 minuti da scaricare
46 minuti da estrarre
5 ore e 45 minuti per la pre-elaborazione

I tempi di esecuzione sarebbero notevolmente inferiori in un cluster più grande in cui le partizioni di dati possono essere parallelizzate su un massimo di 30 nodi.

Formazione

Nella fase di formazione, modificherai gli script per eseguire la formazione LLM.

Modificare launcher_scripts/conf/config.yaml:

Impostare il job stage su training.

stages:
  #- data_preparation
  - training

Aggiungi variabili per configurare NVIDIA NVLink su OCI.

Nella sezione env_vars, aggiungere queste variabili per configurare NVIDIA NVLink su OCI. Lasciare le variabili esistenti in posizione, ma commentare le eventuali variabili che stiamo sostituendo con nuovi valori.

env_vars:
  TRANSFORMERS_OFFLINE: 0 # (was 1)
  . . .
  RX_QUEUE_LEN: 8192
  IB_RX_QUEUE_LEN: 8192
  UCX_TLS: tcp
  HCOLL_ENABLE_MCAST_ALL: 0
  coll_hcoll_enable: 0
  UCX_NET_DEVICES: ens300
  NCCL_SOCKET_IFNAME: ens300
  NCCL_IB_TIMEOUT: 16
  NCCL_IB_SL: 0
  NCCL_IB_TC: 41
  NCCL_ALGO: Auto  # tree, ring
  NCCL_IB_GID_INDEX: 3
  NCCL_IB_QPS_PER_CONNECTION: 16  # was 4
  NCCL_IB_HCA: \'mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_11,mlx5_12,mlx5_14,mlx5_15,mlx5_16,mlx5_17\'
  NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information

Per informazioni, vedere https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html.

Modificare conf/training/gpt3/5b.yaml.

run:
  time_limit: "6-00:00:00"  # allow the training job to run for 6 days

trainer:
  num_nodes: 2 # (was 16) set to the size of your cluster

model:
  micro_batch_size: 2           # (was 4) change to fit in A100/40GB memory
  tensor_model_parallel_size: 2 # (was 1) change to fit in A100/40GB memory

  optim:
    bucket_cap_mb: 200 # (was 400)

Dalla directory launcher_scripts eseguire main.py per sottomettere il job a Slurm. Utilizzare il comando squeue per verificare che il job sia in esecuzione ("R").

$ cd /nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts
$ python main.py
Job nemo-megatron-gpt3_5b submission file created at '/nfs/scratch/NeMo-Megatron-Launcher/launcher_scripts/results/gpt3_5b/nemo-megatron-gpt3_5b_submission.sh'
Job nemo-megatron-gpt3_5b submitted with Job ID 285
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               285       gpu nemo-meg      opc  R       0:06      2 gpu-permanent-node-[517,878]

Esaminare i file in results/gpt3_5b/* per visualizzare i messaggi di output e di errore e monitorare l'avanzamento dei job in esecuzione.

La formazione LLM funziona se vedi linee simili a questa nel gpt3_5b_nnn.out:

Training:   0%|          | 0/75375 [00:00<?]
Epoch 0: :   0%|          | 0/75375 [00:00<?]
Epoch 0: :   0%|          | 1/75375 [00:52<1089:12:32]
Epoch 0: :   0%|          | 1/75375 [00:52<1089:13:02 ... train_step_timing in s=52.00]
Epoch 0: :   0%|          | 2/75375 [01:33<980:51:18 ... train_step_timing in s=52.00]
Epoch 0: :   0%|          | 2/75375 [01:33<980:51:33 ... train_step_timing in s=46.80]
Epoch 0: :   0%|          | 3/75375 [02:15<945:29:05 ... train_step_timing in s=46.80]
Epoch 0: :   0%|          | 3/75375 [02:15<945:29:14 ... train_step_timing in s=45.20]
Epoch 0: :   0%|          | 4/75375 [02:57<926:40:09 ... train_step_timing in s=45.20]
Epoch 0: :   0%|          | 4/75375 [02:57<926:40:16 ... train_step_timing in s=44.30]
Epoch 0: :   0%|          | 5/75375 [03:38<915:10:14 ... train_step_timing in s=44.30]
Epoch 0: :   0%|          | 5/75375 [03:38<915:10:20 ... train_step_timing in s=43.70]
Epoch 0: :   0%|          | 6/75375 [04:20<907:25:47 ... train_step_timing in s=43.70]
Epoch 0: :   0%|          | 6/75375 [04:20<907:25:52 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 7/75375 [05:01<901:53:34 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 7/75375 [05:01<901:53:38 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 8/75375 [05:43<897:38:17 ... train_step_timing in s=41.60]
Epoch 0: :   0%|          | 8/75375 [05:43<897:38:21 ... train_step_timing in s=41.50]
Epoch 0: :   0%|          | 9/75375 [06:24<894:16:56 ... train_step_timing in s=41.50]
Epoch 0: :   0%|          | 9/75375 [06:24<894:16:59 ... train_step_timing in s=41.50]
Epoch 0: :   0%|          | 10/75375 [07:05<891:30:50 ... train_step_timing in s=41.50]

Ciò indica che ogni fase di allenamento è stata completata in 41,5 secondi. È possibile ottenere tempi di passaggio molto più rapidi con più GPU nel cluster.