Note:

This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.

Fine-Tune and Serve Large Language Models on Oracle Cloud Infrastructure with dstack

Introduction

dstack is an open-source tool that simplifies Artificial Intelligence (AI) container orchestration and makes distributed training and deployment of Large Language Models (LLMs) more accessible. Combining dstack and Oracle Cloud Infrastructure (OCI) unlocks a streamlined process for setting up cloud infrastructure for distributed training and scalable model deployment.

How `dstack` works?

dstack offers a unified interface for the development, training, and deployment of AI models across any cloud or data center. For example, you can specify a configuration for a training task or a model to be deployed, and dstack will take care of setting up the required infrastructure and orchestrating the containers. One of the advantages dstack offers is that it allows the use of any hardware, frameworks, and scripts.

Objectives

Fine-tune a LLM model using dstack on OCI, incorporate best practices from the Hugging Face Alignment Handbook, and deploy the model using Hugging Face Text Generation Inference (TGI).

Note: The experiment described in the tutorial uses an OCI cluster of three nodes, each with 2 x A10 GPUs, to fine-tune the Gemma 7B model.

Prerequisites

Access to an OCI tenancy.
Access to shapes with NVIDIA GPU such as A10 GPUs (for example, VM.GPU.A10.2). For more information on requests to increase the limit, see Service Limits.
A Hugging Face account with an access token configured to download Gemma 7B model.

Task 1: Set up `dstack` with OCI

Install the dstack Python package using the following command. Since dstack supports multiple cloud providers, we can narrow down the scope to OCI.
```
pip install dstack[oci]
```
Configure the OCI specific credentials inside the ~/.dstack/server/config.yml file. The following code assumes that you have credentials for OCI Command Line Interface (CLI) configured. For other configuration options, see the dstack documentation.
```
projects:
- name: main
  backends:
  - type: oci
    creds:
      type: default
```

Run the dstack server using the following command.

dstack server
INFO     Applying ~/.dstack/server/config.yml...
INFO     Configured the main project in ~/.dstack/config.yml
INFO     The admin token is ab6e8759-9cd9-4e84-8d47-5b94ac877ebf
INFO     The dstack server 0.18.4 is running at http://127.0.0.1:3000

Switch to the folder with your project scripts and initialize dstack using the following command.
```
dstack init
```

Task 2: Fine-Tune Job on OCI with `dstack`

To fine-tune Gemma 7B model, we will be using the Hugging Face Alignment Handbook to ensure the incorporation of the best fine-tuning practices. The source code of this tutorial can be obtained from here: GitHub. Let us dive into the practical steps for fine-tuning your LLM.

Once, you switch to the project folder, use the following commands to initiate the fine-tuning job on OCI with dstack.

ACCEL_CONFIG_PATH=fsdp_qlora_full_shard.yaml \
  FT_MODEL_CONFIG_PATH=qlora_finetune_config.yaml \
  HUGGING_FACE_HUB_TOKEN=xxxx \
  WANDB_API_KEY=xxxx \
  dstack run . -f ft.task.dstack.yml

The FT_MODEL_CONFIG_PATH, ACCEL_CONFIG_PATH, HUGGING_FACE_HUB_TOKEN, and WANDB_API_KEY environment variables are defined inside the ft.task.dstack.yml task configuration. dstack run submits the task defined in ft.task.dstack.yml on OCI.

Note: dstack automatically copies the current directory content when executing the task.

Let us explore the key parts of each YAML file (for the full contents, check the repository).

The qlora_finetune_config.yaml file is the recipe configuration that the Alignment Handbook uses to understand how you want to fine-tune a large language model (LLM).

# Model arguments
model_name_or_path: google/gemma-7b
tokenizer_name_or_path: philschmid/gemma-tokenizer-chatml
torch_dtype: bfloat16
bnb_4bit_quant_storage: bfloat16

# LoRA arguments
load_in_4bit: true
use_peft: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
# ...

# Data training arguments
dataset_mixer:
  chansung/mental_health_counseling_conversations: 1.0
dataset_splits:
  - train
  - test
# ...

Model Arguments:
- model_name_or_path: Google’s Gemma 7B is chosen as the base model.
- tokenizer_name_or_path: Alignment Handbook uses apply_chat_template() method of the chosen tokenizer. This tutorial uses the ChatML template instead of Gemma 7B’s standard conversation template.
- torch_dtype and bnb_4bit_quant_storage: These two values should be defined the same if we want to leverage FSDP+QLoRA fine-tuning method. Since Gemma 7B is hard to fit into a single A10 GPU, this tutorial uses FSDP+QLoRA to shard a model into 2 x A10 GPUs while leveraging QLoRA technique.
LoRA Arguments: LoRA specific configurations. Since this tutorial post leverages FSDP+QLoRA technique, load_in_4bit is set to true. Other configurations could vary from experiment to experiment.
Data Training Arguments: We have prepared a dataset which is based on Amod’s mental health counselling conversations’ dataset. Since alignment-handbook only understands the data in the form of [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, …] which can be interpreted with tokenizer’s apply_chat_template() method, the prepared dataset is basically the conversion of the original dataset into the apply_chat_template() compatible format.

The fsdp_qlora_full_shard.yaml file configures accelerate how to use the underlying infrastructure for fine-tuning the LLM.

compute_environment: LOCAL_MACHINE
distributed_type: FSDP  # Use Fully Sharded Data Parallelism
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_use_orig_params: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  # ... (other FSDP configurations)
# ... (other configurations)

distributed_type: FSDP indicates the use of Fully Sharded Data Parallel (FSDP), a technique that enables training large models that would otherwise not fit on a single GPU.
fsdp_config: These set up how FSDP operates, such as how the model is sharded (fsdp_sharding_strategy) and whether parameters are offloaded to CPU (fsdp_offload_params).

Hybrid shards

With the FSDP of distributed_type and FULL_SHARD of fsdp_config’s fsdp_sharding_strategy, a model will be sharded across multiple GPUs in a single machine. When dealing with multiple compute nodes, each node will host an identical copy of the model, which is itself split across multiple GPUs within that node. This means each partitioned model instance on each node processes different sections or batches of your dataset. To distribute a single model across multiple GPUs spanning across multiple nodes, configure the parameter fsdp_sharding_strategy as HYBRID_SHARD.

Additional parameters like machine_rank, num_machines, and num_processes are important for coordination. However, it is recommended to set these values dynamically at runtime, as this provides flexibility when switching between different infrastructure set ups.

Task 3: Instruct `dstack` using Simplified Configuration to Provision Infrastructure

Let us explore the fsdp_qlora_full_shard.yaml configuration that puts everything together and instructs dstack on how to provision infrastructure and run the task.

type: task
nodes: 3

python: "3.11"
env:
  - ACCEL_CONFIG_PATH
  - FT_MODEL_CONFIG_PATH
  - HUGGING_FACE_HUB_TOKEN
  - WANDB_API_KEY
commands:
  # ... (setup steps, cloning repo, installing requirements)
  - ACCELERATE_LOG_LEVEL=info accelerate launch \
      --config_file recipes/custom/accel_config.yaml \
      --main_process_ip=$DSTACK_MASTER_NODE_IP \
      --main_process_port=8008 \
      --machine_rank=$DSTACK_NODE_RANK \
      --num_processes=$DSTACK_GPUS_NUM \
      --num_machines=$DSTACK_NODES_NUM \
      scripts/run_sft.py recipes/custom/config.yaml
ports:
  - 6006
resources:
  gpu: 1..2
  shm_size: 24GB

Key Points to Highlight::

Seamless Integration: dstack effortlessly integrates with Hugging Face open source ecosystem. In particular, you can simply use the accelerate library with the configurations that we defined in fsdp_qlora_full_shard.yaml as normal.
Automatic Configuration: DSTACK_MASTER_NODE_IP, DSTACK_NODE_RANK, DSTACK_GPUS_NUM, and DSTACK_NODES_NUM variables are automatically managed by dstack, reducing manual set up.
Resource Allocation: dstack makes it easy to specify the number of nodes and GPUs (gpu: 1..2) for your fine-tuning job. Hence, for this tutorial, there are three nodes each of which is equipped with 2 x A10(24GB) GPUs.

Task 4: Serve your Fine-Tuned Model with `dstack`

Once your model is fine-tuned, dstack makes it a breeze to deploy it on OCI using Hugging Face Text Generation Inference (TGI) framework.

Here is an example of how you can define a service in dstack:

type: service
image: ghcr.io/huggingface/text-generation-inference:latest
env:
  - HUGGING_FACE_HUB_TOKEN
  - MODEL_ID=chansung/mental_health_counseling_merged_v0.1
commands:
  - text-generation-launcher \
    --max-input-tokens 512 --max-total-tokens 1024 \
    --max-batch-prefill-tokens 512 --port 8000
port: 8000

resources:
  gpu:
    memory: 48GB

# (Optional) Enable the OpenAI-compatible endpoint
model:
  format: tgi
  type: chat
  name: chansung/mental_health_counseling_merged_v0.1

Key Advantages of This Approach:

Secure HTTPS Gateway: dstack simplifies the process of setting up a secure HTTPS connection through a gateway, a crucial aspect of production-level model serving.
Optimized for Inference: The TGI framework is designed for efficient text generation inference, ensuring your model delivers responsive and reliable results.
Auto-scaling: dstack allows to specify the autoscaling policy, including the minimum and maximum number of model replicas.

At this point, you can interact with the service using standard curl command and Python requests, OpenAI SDK, and Hugging Face’s Inference Client libraries. For instance, the following code snippet shows an example of curl.

curl -X POST https://black-octopus-1.mycustomdomain.com/generate \
  -H "Authorization: Bearer <dstack-token>" \
  -H 'Content-Type: application/json' \
  -d '{"inputs": "I feel bad...", "parameters": {"max_new_tokens": 128}}'

Additionally, for a deployed model, dstack automatically provides a user interface to directly interact with the model.

User Interface

Next Steps

By following the steps outlined in this tutorial, you have unlocked a powerful approach to fine-tuning and deploying LLMs using the combined capabilities of dstack, OCI, and the Hugging Face ecosystem. You can now leverage dstack’s user-friendly interface to manage your OCI resources effectively, streamlining the process of setting up distributed training environments for your LLM projects.

Furthermore, the integration with Hugging Face’s Alignment Handbook and TGI framework empowers you to fine-tune and serve your models seamlessly, ensuring they are optimized for performance and ready for real-world applications. We encourage you to explore the possibilities further and experiment with different models and configurations to achieve your desired outcomes in the world of natural language processing.

Acknowledgments

Author - Chansung Park (HuggingFace fellow - AI researcher), Yann Caniou (AI Infra/GPU Specialist), Bruno Garbaccio (AI Infra/GPU Specialist)

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.

Title and Copyright Information

Fine-Tune and Serve Large Language Models on Oracle Cloud Infrastructure with dstack

G11930-01

July 2024

Oracle and/or its affiliates.

Fine-Tune and Serve Large Language Models on Oracle Cloud Infrastructure with dstack

Introduction

How dstack works?

Objectives

Prerequisites

Task 1: Set up dstack with OCI

Task 2: Fine-Tune Job on OCI with dstack

Task 3: Instruct dstack using Simplified Configuration to Provision Infrastructure

Task 4: Serve your Fine-Tuned Model with dstack

Next Steps

Acknowledgments

More Learning Resources

How `dstack` works?

Task 1: Set up `dstack` with OCI

Task 2: Fine-Tune Job on OCI with `dstack`

Task 3: Instruct `dstack` using Simplified Configuration to Provision Infrastructure

Task 4: Serve your Fine-Tuned Model with `dstack`