Note:
- This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
- It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.
Set up a Simple LLM Inference Benchmarking System with vLLM on Oracle Cloud Infrastructure Compute
Introduction
Understanding a system’s performance characteristics allows you to make informed choices about its components and what to expect of them. In this tutorial, we will set up a complete benchmarking system for AI large language model (LLM) inferencing. Using this, you can run various experiments from assessing the suitability of a particular Oracle Cloud Infrastructure (OCI) Compute shape for the LLM and performance requirements in mind to comparing various LLMs against each other.
In this tutorial, we use the popular open source inference server vLLM, which supports many state of the art LLMs and offers performance optimizations crucial for efficiently serving these models and is able to handle thousands of concurrent requests. Furthermore, we use Ray LLMPerf for running the benchmarks.
Note: This tutorial assumes that you have an OCI tenancy with a GPU quota. For more information about quotas, see Compute Quotas.
Objectives
-
Provision OCI Compute GPU instances.
-
Install typical AI stack prerequisites.
-
Set up vLLM and LLMPerf.
-
Run LLM inference performance benchmark.
Prerequisites
-
Access to an OCI tenancy.
-
Access to shapes with NVIDIA GPU such as A10 GPUs. For example,
VM.GPU.A10.1
. For more information about requests to increase the limit, see Service Limits. -
A Hugging Face account with an access token configured and permission to download the Llama-3.2-3B-Instruct model.
Task 1: Configure your Network
In this task, configure your Virtual Cloud Network (VCN) to provide a functional and secure setup to run your benchmarks. In this example, we need a VCN with one public and one private subnet. You can use the VCN setup wizard or manually set up all the components, making sure security lists allow SSH access and a NAT gateway is assigned to the private subnet.
Before provisioning OCI Compute instances and proceeding with the benchmark setup, it is important to ensure that your network settings are properly configured. By default, the provisioned network allows only essential traffic between subnets. To allow the network traffic required for benchmarking, you will need to adjust these settings to permit HTTP traffic between the public and private subnets. Specifically, add an ingress rule to the security list of the private subnet that allows traffic from the VCN CIDR block to destination port 8000.
Task 2: Provision OCI Compute Instances
In this task, provision two OCI Compute instances for our benchmark setup. One being a CPU-only virtual machine (VM) instance that acts as the bastion and the benchmark client machine, the other being the GPU-equipped instance under test.
-
Provision a
VM.Flex.E5
instance with eight OCPUs and Oracle Linux 8. The bastion/benchmark client does not require a GPU since it will only be sending requests to the model hosted on a separate machine, hence the choice of a CPU shape. Make sure to select the public subnet in the network parameters and do not forget to upload your SSH public key (or download the provided private key if you prefer). -
Provision a
VM.GPU.A10.1
instance. To streamline the setup, provision the instance using an Oracle Linux 8 image that includes NVIDIA drivers and the CUDA framework. Begin by selecting the desired shape, then navigate back to the image selection menu and choose the Oracle Linux 8 variant with built-in GPU support. This instance should be provisioned in the private subnet, meaning it can only be accessed using the bastion host. Make sure to set up the SSH key too.
Task 3: Set up the Benchmark Client
In this task, we will install all the required components for our performance benchmarks and configure the GPU host.
-
Log in to your bastion with your SSH key and set up all requirements for
LLMPerf
using the following command.sudo dnf install epel-release -y sudo yum-config-manager --enable ol8_baseos_latest ol8_appstream ol8_addons ol8_developer_EPEL sudo dnf install git python3.11 python3.11-devel python3.11-pip -y
-
Clone the Ray LLMPerf repository, set up a Python
venv
and install LLMPerf using the following command.git clone https://github.com/ray-project/llmperf.git cd llmperf mkdir venv && python3.11 -mvenv venv && source venv/bin/activate
-
Before installing LLMPerf (and its Python dependencies), edit the
pyproject.toml
file and remove the Python requirement clause. The clause unnecessarily limits the Python version to be less than3.11
.diff --git a/pyproject.toml b/pyproject.toml index 7687fb2..521a2a7 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -8,7 +8,6 @@ version = "0.1.0" description = "A framework for load testing LLM APIs" authors = [{name="Avnish Narayan", email="avnish@anyscale.com"}] license = {text= "Apache-2.0"} -requires-python = ">=3.8, <3.11" dependencies = ["pydantic<2.5", "ray", "pytest>=6.0",
-
Run the
pip install -e
command to finalize the setup.
Task 4: Set up the Benchmark Target
In this task, we will set up the benchmark target itself. As we have already provisioned the node with the necessary drivers and Compute Unified Device Architecture (CUDA) framework, we simply have to install vLLM and its python dependencies and deploy the model that we wish to benchmark.
Note: The GPU compute instance you have provisioned reside in a private subnet. So to reach it, you must first log into the bastion host and set up the private SSH key that you chose for the benchmark target. Only then will you be able to log into the benchmark target using its private IP address.
-
Install the necessary packages, like the initial setup on the bastion host. Additionally, update the host’s firewall to permit incoming traffic on port
8000
using the following command.sudo dnf install epel-release -y sudo yum-config-manager --enable ol8_baseos_latest ol8_appstream ol8_addons ol8_developer_EPEL sudo dnf install git python3.11 python3.11-devel python3.11-pip -y sudo firewall-cmd --permanent --zone=public --add-port=8000/tcp sudo firewall-cmd --reload
-
Run the following command to install vLLM and its requirements.
mkdir venv python3.11 -mvenv venv && source venv/bin/activate pip install -U pip "bitsandbytes>=0.44.0" vllm gpustat mistral_common
-
We are ready to start vLLM with the model we would like to test.
export HF_TOKEN=<your huggingface token> export MODEL="meta-llama/Llama-3.2-3B-Instruct" ulimit -Sn 65536 # increase limits to avoid running out of files that can be opened vllm serve $MODEL --tokenizer-mode auto --config-format hf --load-format auto \ --enforce-eager --max-model-len 65536
Note: We need to reduce the model context length to fit into the A10 GPU memory, we lower the context length from the default size of 128k tokens to 64k tokens. After loading, the model vLLM should start outputting its inference statistics at regular intervals.
INFO 12-09 15:46:36 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
-
We let the server continue running in the background while we switch to another terminal where we will now test and benchmark our setup. A simple test can be run with
curl
as follows.export HOST="<the ip address of the benchmark machine" export MODEL="meta-llama/Llama-3.2-3B-Instruct" curl --location "http://${HOST}:8000/v1/chat/completions" --header 'Content-Type: application/json' --header 'Authorization: Bearer token' --data '{ "model": "'"$MODEL"'", "messages": [ { "role": "user", "content": "What is the question to the answer to the ultimate question of life, the universe, and everything. You may give a humorous response." } ] }' -s | jq
Output will be:
{ "id": "chatcmpl-f11306f943604d41bad84be1dadd9da6", "object": "chat.completion", "created": 1733997638, "model": "meta-llama/Llama-3.2-3B-Instruct", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "You want to know the ultimate question to the answer of 42?\n\nWell, after years of intense research and contemplation, I've discovered that the answer is actually a giant, cosmic joke. The question is: \"What's for lunch?\"", "tool_calls": [] }, "logprobs": null, "finish_reason": "stop", "stop_reason": null } ], "usage": { "prompt_tokens": 62, "total_tokens": 112, "completion_tokens": 50, "prompt_tokens_details": null }, "prompt_logprobs": null }
Task 5: Run the Benchmark
Now, we are ready to run our benchmark. Given a particular application scenario with a chosen large language model, we will like to understand on the target system what are the performance characteristics of running concurrent inference requests.
Let us benchmark following scenario:
PARAMETER | VALUE |
---|---|
MODEL | Meta LLaMa 3.2 3B Instruct |
USE-CASE | chat |
INPUT TOKENS | N(200, 40) |
OUTPUT TOKENS | N(100, 10) |
CONCURRENT REQUESTS | 1 - 32 |
-
Create the following script named
llm_benchmark.sh
on the benchmark client (bastion host) inside the home directory.#!/usr/bin/env bash set -xe # Use vLLM OpenAPI endpoint export OPENAI_API_BASE="http://<benchmark host>:8000/v1" # API key is not in use, but needs to be set for llmbench export OPENAI_API_KEY="none" model="meta-llama/Llama-3.2-3B-Instruct" modelname="${model##*/}" mkdir "$modelname" concurrent_requests=(1 2 4 8 16 32) pushd llmperf source venv/bin/activate for cr in "${concurrent_requests[@]}" do python3 token_benchmark_ray.py --model $model \ --mean-input-tokens 200 --stddev-input-tokens 40 \ --mean-output-tokens 100 --stddev-output-tokens 10 \ --max-num-completed-requests $((cr * 100)) --num-concurrent-requests $cr \ --metadata "use_case=chatbot" \ --timeout 1800 --results-dir "../$modelname/result_outputs_chat_creq_$cr" --llm-api openai done popd
This script will allow you to automatically run through a series of benchmarks with llmperf with increasing number of concurrent requests (starting at 1 and successively doubling up to 32). As you can see from the arguments passed to the
token_benchmark_ray.py
script we are setting token inputs and outputs as defined in the table above. -
Run your benchmark script using the following command.
bash -x llm_benchmark.sh
Once done, you will find a new directory called Llama-3.2-3B-Instruct
in your home directory, where all
the experiment results will be stored in JSON format, which you can download and postprocess using your favorite data analysis tool.
Note: One easy way to turn your benchmarks into a plot is to extract the figures that most interest you using a little shell script and
jq
into.csv
format, which can easily be copy-pasted into Excel.echo "concurrent_requests,token_throughput" for i in *; do cat $i/*_summary.json | jq -r '[.num_concurrent_requests, .results_request_output_throughput_token_per_s_mean] | join(",")' done;
Acknowledgments
- Author - Omar Awile (GPU Solution Specialist, Oracle EMEA GPU Specialist Team)
More Learning Resources
Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.
For product documentation, visit Oracle Help Center.
Set up a Simple LLM Inference Benchmarking System with vLLM on Oracle Cloud Infrastructure Compute
G27261-01
February 2025
Copyright ©2025, Oracle and/or its affiliates.