Note:

Fine-Tune and Serve Large Language Models on Oracle Cloud Infrastructure with dstack

Introduction

dstack is an open-source tool that simplifies Artificial Intelligence (AI) container orchestration and makes distributed training and deployment of Large Language Models (LLMs) more accessible. Combining dstack and Oracle Cloud Infrastructure (OCI) unlocks a streamlined process for setting up cloud infrastructure for distributed training and scalable model deployment.

How dstack works?

dstack offers a unified interface for the development, training, and deployment of AI models across any cloud or data center. For example, you can specify a configuration for a training task or a model to be deployed, and dstack will take care of setting up the required infrastructure and orchestrating the containers. One of the advantages dstack offers is that it allows the use of any hardware, frameworks, and scripts.

Objectives

Prerequisites

Task 1: Set up dstack with OCI

  1. Install the dstack Python package using the following command. Since dstack supports multiple cloud providers, we can narrow down the scope to OCI.

    pip install dstack[oci]
    
  2. Configure the OCI specific credentials inside the ~/.dstack/server/config.yml file. The following code assumes that you have credentials for OCI Command Line Interface (CLI) configured. For other configuration options, see the dstack documentation.

    projects:
    - name: main
      backends:
      - type: oci
        creds:
          type: default
    
  3. Run the dstack server using the following command.

    dstack server
    INFO     Applying ~/.dstack/server/config.yml...
    INFO     Configured the main project in ~/.dstack/config.yml
    INFO     The admin token is ab6e8759-9cd9-4e84-8d47-5b94ac877ebf
    INFO     The dstack server 0.18.4 is running at http://127.0.0.1:3000
    
  4. Switch to the folder with your project scripts and initialize dstack using the following command.

    dstack init
    

Task 2: Fine-Tune Job on OCI with dstack

To fine-tune Gemma 7B model, we will be using the Hugging Face Alignment Handbook to ensure the incorporation of the best fine-tuning practices. The source code of this tutorial can be obtained from here: GitHub. Let us dive into the practical steps for fine-tuning your LLM.

Once, you switch to the project folder, use the following commands to initiate the fine-tuning job on OCI with dstack.

ACCEL_CONFIG_PATH=fsdp_qlora_full_shard.yaml \
  FT_MODEL_CONFIG_PATH=qlora_finetune_config.yaml \
  HUGGING_FACE_HUB_TOKEN=xxxx \
  WANDB_API_KEY=xxxx \
  dstack run . -f ft.task.dstack.yml

The FT_MODEL_CONFIG_PATH, ACCEL_CONFIG_PATH, HUGGING_FACE_HUB_TOKEN, and WANDB_API_KEY environment variables are defined inside the ft.task.dstack.yml task configuration. dstack run submits the task defined in ft.task.dstack.yml on OCI.

Note: dstack automatically copies the current directory content when executing the task.

Let us explore the key parts of each YAML file (for the full contents, check the repository).

The qlora_finetune_config.yaml file is the recipe configuration that the Alignment Handbook uses to understand how you want to fine-tune a large language model (LLM).

# Model arguments
model_name_or_path: google/gemma-7b
tokenizer_name_or_path: philschmid/gemma-tokenizer-chatml
torch_dtype: bfloat16
bnb_4bit_quant_storage: bfloat16

# LoRA arguments
load_in_4bit: true
use_peft: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
# ...

# Data training arguments
dataset_mixer:
  chansung/mental_health_counseling_conversations: 1.0
dataset_splits:
  - train
  - test
# ...

The fsdp_qlora_full_shard.yaml file configures accelerate how to use the underlying infrastructure for fine-tuning the LLM.

compute_environment: LOCAL_MACHINE
distributed_type: FSDP  # Use Fully Sharded Data Parallelism
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_use_orig_params: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  # ... (other FSDP configurations)
# ... (other configurations)

Hybrid shards

With the FSDP of distributed_type and FULL_SHARD of fsdp_config’s fsdp_sharding_strategy, a model will be sharded across multiple GPUs in a single machine. When dealing with multiple compute nodes, each node will host an identical copy of the model, which is itself split across multiple GPUs within that node. This means each partitioned model instance on each node processes different sections or batches of your dataset. To distribute a single model across multiple GPUs spanning across multiple nodes, configure the parameter fsdp_sharding_strategy as HYBRID_SHARD.

Additional parameters like machine_rank, num_machines, and num_processes are important for coordination. However, it is recommended to set these values dynamically at runtime, as this provides flexibility when switching between different infrastructure set ups.

Task 3: Instruct dstack using Simplified Configuration to Provision Infrastructure

Let us explore the fsdp_qlora_full_shard.yaml configuration that puts everything together and instructs dstack on how to provision infrastructure and run the task.

type: task
nodes: 3

python: "3.11"
env:
  - ACCEL_CONFIG_PATH
  - FT_MODEL_CONFIG_PATH
  - HUGGING_FACE_HUB_TOKEN
  - WANDB_API_KEY
commands:
  # ... (setup steps, cloning repo, installing requirements)
  - ACCELERATE_LOG_LEVEL=info accelerate launch \
      --config_file recipes/custom/accel_config.yaml \
      --main_process_ip=$DSTACK_MASTER_NODE_IP \
      --main_process_port=8008 \
      --machine_rank=$DSTACK_NODE_RANK \
      --num_processes=$DSTACK_GPUS_NUM \
      --num_machines=$DSTACK_NODES_NUM \
      scripts/run_sft.py recipes/custom/config.yaml
ports:
  - 6006
resources:
  gpu: 1..2
  shm_size: 24GB

Key Points to Highlight::

Task 4: Serve your Fine-Tuned Model with dstack

Once your model is fine-tuned, dstack makes it a breeze to deploy it on OCI using Hugging Face Text Generation Inference (TGI) framework.

Here is an example of how you can define a service in dstack:

type: service
image: ghcr.io/huggingface/text-generation-inference:latest
env:
  - HUGGING_FACE_HUB_TOKEN
  - MODEL_ID=chansung/mental_health_counseling_merged_v0.1
commands:
  - text-generation-launcher \
    --max-input-tokens 512 --max-total-tokens 1024 \
    --max-batch-prefill-tokens 512 --port 8000
port: 8000

resources:
  gpu:
    memory: 48GB

# (Optional) Enable the OpenAI-compatible endpoint
model:
  format: tgi
  type: chat
  name: chansung/mental_health_counseling_merged_v0.1

Key Advantages of This Approach:

At this point, you can interact with the service using standard curl command and Python requests, OpenAI SDK, and Hugging Face’s Inference Client libraries. For instance, the following code snippet shows an example of curl.

curl -X POST https://black-octopus-1.mycustomdomain.com/generate \
  -H "Authorization: Bearer <dstack-token>" \
  -H 'Content-Type: application/json' \
  -d '{"inputs": "I feel bad...", "parameters": {"max_new_tokens": 128}}'

Additionally, for a deployed model, dstack automatically provides a user interface to directly interact with the model.

User Interface

Next Steps

By following the steps outlined in this tutorial, you have unlocked a powerful approach to fine-tuning and deploying LLMs using the combined capabilities of dstack, OCI, and the Hugging Face ecosystem. You can now leverage dstack’s user-friendly interface to manage your OCI resources effectively, streamlining the process of setting up distributed training environments for your LLM projects.

Furthermore, the integration with Hugging Face’s Alignment Handbook and TGI framework empowers you to fine-tune and serve your models seamlessly, ensuring they are optimized for performance and ready for real-world applications. We encourage you to explore the possibilities further and experiment with different models and configurations to achieve your desired outcomes in the world of natural language processing.

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.