Note:
- This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
- It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.
Fine-Tune and Serve Large Language Models on Oracle Cloud Infrastructure with dstack
Introduction
dstack
is an open-source tool that simplifies Artificial Intelligence (AI) container orchestration and makes distributed training and deployment of Large Language Models (LLMs) more accessible. Combining dstack
and Oracle Cloud Infrastructure (OCI) unlocks a streamlined process for setting up cloud infrastructure for distributed training and scalable model deployment.
How dstack
works?
dstack
offers a unified interface for the development, training, and deployment of AI models across any cloud or data center. For example, you can specify a configuration for a training task or a model to be deployed, and dstack
will take care of setting up the required infrastructure and orchestrating the containers. One of the advantages dstack
offers is that it allows the use of any hardware, frameworks, and scripts.
Objectives
-
Fine-tune a LLM model using
dstack
on OCI, incorporate best practices from the Hugging Face Alignment Handbook, and deploy the model using Hugging Face Text Generation Inference (TGI).Note: The experiment described in the tutorial uses an OCI cluster of three nodes, each with 2 x A10 GPUs, to fine-tune the Gemma 7B model.
Prerequisites
-
Access to an OCI tenancy.
-
Access to shapes with NVIDIA GPU such as A10 GPUs (for example,
VM.GPU.A10.2
). For more information on requests to increase the limit, see Service Limits. -
A Hugging Face account with an access token configured to download Gemma 7B model.
Task 1: Set up dstack
with OCI
-
Install the
dstack
Python package using the following command. Sincedstack
supports multiple cloud providers, we can narrow down the scope to OCI.pip install dstack[oci]
-
Configure the OCI specific credentials inside the
~/.dstack/server/config.yml
file. The following code assumes that you have credentials for OCI Command Line Interface (CLI) configured. For other configuration options, see the dstack documentation.projects: - name: main backends: - type: oci creds: type: default
-
Run the
dstack
server using the following command.dstack server INFO Applying ~/.dstack/server/config.yml... INFO Configured the main project in ~/.dstack/config.yml INFO The admin token is ab6e8759-9cd9-4e84-8d47-5b94ac877ebf INFO The dstack server 0.18.4 is running at http://127.0.0.1:3000
-
Switch to the folder with your project scripts and initialize
dstack
using the following command.dstack init
Task 2: Fine-Tune Job on OCI with dstack
To fine-tune Gemma 7B model, we will be using the Hugging Face Alignment Handbook to ensure the incorporation of the best fine-tuning practices. The source code of this tutorial can be obtained from here: GitHub. Let us dive into the practical steps for fine-tuning your LLM.
Once, you switch to the project folder, use the following commands to initiate the fine-tuning job on OCI with dstack
.
ACCEL_CONFIG_PATH=fsdp_qlora_full_shard.yaml \
FT_MODEL_CONFIG_PATH=qlora_finetune_config.yaml \
HUGGING_FACE_HUB_TOKEN=xxxx \
WANDB_API_KEY=xxxx \
dstack run . -f ft.task.dstack.yml
The FT_MODEL_CONFIG_PATH
, ACCEL_CONFIG_PATH
, HUGGING_FACE_HUB_TOKEN
, and WANDB_API_KEY
environment variables are defined inside the ft.task.dstack.yml
task configuration. dstack run
submits the task defined in ft.task.dstack.yml
on OCI.
Note:
dstack
automatically copies the current directory content when executing the task.
Let us explore the key parts of each YAML file (for the full contents, check the repository).
The qlora_finetune_config.yaml
file is the recipe configuration that the Alignment Handbook uses to understand how you want to fine-tune a large language model (LLM).
# Model arguments
model_name_or_path: google/gemma-7b
tokenizer_name_or_path: philschmid/gemma-tokenizer-chatml
torch_dtype: bfloat16
bnb_4bit_quant_storage: bfloat16
# LoRA arguments
load_in_4bit: true
use_peft: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
# ...
# Data training arguments
dataset_mixer:
chansung/mental_health_counseling_conversations: 1.0
dataset_splits:
- train
- test
# ...
-
Model Arguments:
model_name_or_path
: Google’s Gemma 7B is chosen as the base model.tokenizer_name_or_path
: Alignment Handbook usesapply_chat_template()
method of the chosen tokenizer. This tutorial uses the ChatML template instead of Gemma 7B’s standard conversation template.torch_dtype
andbnb_4bit_quant_storage
: These two values should be defined the same if we want to leverage FSDP+QLoRA fine-tuning method. Since Gemma 7B is hard to fit into a single A10 GPU, this tutorial uses FSDP+QLoRA to shard a model into 2 x A10 GPUs while leveraging QLoRA technique.
-
LoRA Arguments: LoRA specific configurations. Since this tutorial post leverages FSDP+QLoRA technique,
load_in_4bit
is set totrue
. Other configurations could vary from experiment to experiment. -
Data Training Arguments: We have prepared a dataset which is based on Amod’s mental health counselling conversations’ dataset. Since alignment-handbook only understands the data in the form of
[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, …]
which can be interpreted with tokenizer’sapply_chat_template()
method, the prepared dataset is basically the conversion of the original dataset into theapply_chat_template()
compatible format.
The fsdp_qlora_full_shard.yaml
file configures accelerate how to use the underlying infrastructure for fine-tuning the LLM.
compute_environment: LOCAL_MACHINE
distributed_type: FSDP # Use Fully Sharded Data Parallelism
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_use_orig_params: false
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
# ... (other FSDP configurations)
# ... (other configurations)
distributed_type
:FSDP
indicates the use of Fully Sharded Data Parallel (FSDP), a technique that enables training large models that would otherwise not fit on a single GPU.fsdp_config
: These set up how FSDP operates, such as how the model is sharded (fsdp_sharding_strategy
) and whether parameters are offloaded to CPU (fsdp_offload_params
).
With the FSDP
of distributed_type
and FULL_SHARD
of fsdp_config
’s fsdp_sharding_strategy
, a model will be sharded across multiple GPUs in a single machine. When dealing with multiple compute nodes, each node will host an identical copy of the model, which is itself split across multiple GPUs within that node. This means each partitioned model instance on each node processes different sections or batches of your dataset. To distribute a single model across multiple GPUs spanning across multiple nodes, configure the parameter fsdp_sharding_strategy
as HYBRID_SHARD
.
Additional parameters like machine_rank
, num_machines
, and num_processes
are important for coordination. However, it is recommended to set these values dynamically at runtime, as this provides flexibility when switching between different infrastructure set ups.
Task 3: Instruct dstack
using Simplified Configuration to Provision Infrastructure
Let us explore the fsdp_qlora_full_shard.yaml
configuration that puts everything together and instructs dstack
on how to provision infrastructure and run the task.
type: task
nodes: 3
python: "3.11"
env:
- ACCEL_CONFIG_PATH
- FT_MODEL_CONFIG_PATH
- HUGGING_FACE_HUB_TOKEN
- WANDB_API_KEY
commands:
# ... (setup steps, cloning repo, installing requirements)
- ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file recipes/custom/accel_config.yaml \
--main_process_ip=$DSTACK_MASTER_NODE_IP \
--main_process_port=8008 \
--machine_rank=$DSTACK_NODE_RANK \
--num_processes=$DSTACK_GPUS_NUM \
--num_machines=$DSTACK_NODES_NUM \
scripts/run_sft.py recipes/custom/config.yaml
ports:
- 6006
resources:
gpu: 1..2
shm_size: 24GB
Key Points to Highlight::
- Seamless Integration:
dstack
effortlessly integrates with Hugging Face open source ecosystem. In particular, you can simply use the accelerate library with the configurations that we defined infsdp_qlora_full_shard.yaml
as normal. - Automatic Configuration:
DSTACK_MASTER_NODE_IP
,DSTACK_NODE_RANK
,DSTACK_GPUS_NUM
, andDSTACK_NODES_NUM
variables are automatically managed bydstack
, reducing manual set up. - Resource Allocation:
dstack
makes it easy to specify the number of nodes and GPUs (gpu: 1..2) for your fine-tuning job. Hence, for this tutorial, there are three nodes each of which is equipped with 2 x A10(24GB) GPUs.
Task 4: Serve your Fine-Tuned Model with dstack
Once your model is fine-tuned, dstack
makes it a breeze to deploy it on OCI using Hugging Face Text Generation Inference (TGI) framework.
Here is an example of how you can define a service in dstack
:
type: service
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- HUGGING_FACE_HUB_TOKEN
- MODEL_ID=chansung/mental_health_counseling_merged_v0.1
commands:
- text-generation-launcher \
--max-input-tokens 512 --max-total-tokens 1024 \
--max-batch-prefill-tokens 512 --port 8000
port: 8000
resources:
gpu:
memory: 48GB
# (Optional) Enable the OpenAI-compatible endpoint
model:
format: tgi
type: chat
name: chansung/mental_health_counseling_merged_v0.1
Key Advantages of This Approach:
- Secure HTTPS Gateway:
dstack
simplifies the process of setting up a secure HTTPS connection through a gateway, a crucial aspect of production-level model serving. - Optimized for Inference: The TGI framework is designed for efficient text generation inference, ensuring your model delivers responsive and reliable results.
- Auto-scaling:
dstack
allows to specify the autoscaling policy, including the minimum and maximum number of model replicas.
At this point, you can interact with the service using standard curl
command and Python requests, OpenAI SDK, and Hugging Face’s Inference Client libraries. For instance, the following code snippet shows an example of curl.
curl -X POST https://black-octopus-1.mycustomdomain.com/generate \
-H "Authorization: Bearer <dstack-token>" \
-H 'Content-Type: application/json' \
-d '{"inputs": "I feel bad...", "parameters": {"max_new_tokens": 128}}'
Additionally, for a deployed model, dstack
automatically provides a user interface to directly interact with the model.
Next Steps
By following the steps outlined in this tutorial, you have unlocked a powerful approach to fine-tuning and deploying LLMs using the combined capabilities of dstack
, OCI, and the Hugging Face ecosystem. You can now leverage dstack
’s user-friendly interface to manage your OCI resources effectively, streamlining the process of setting up distributed training environments for your LLM projects.
Furthermore, the integration with Hugging Face’s Alignment Handbook and TGI framework empowers you to fine-tune and serve your models seamlessly, ensuring they are optimized for performance and ready for real-world applications. We encourage you to explore the possibilities further and experiment with different models and configurations to achieve your desired outcomes in the world of natural language processing.
Acknowledgments
- Author - Chansung Park (HuggingFace fellow - AI researcher), Yann Caniou (AI Infra/GPU Specialist), Bruno Garbaccio (AI Infra/GPU Specialist)
More Learning Resources
Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.
For product documentation, visit Oracle Help Center.
Fine-Tune and Serve Large Language Models on Oracle Cloud Infrastructure with dstack
G11930-01
July 2024