Note:

Set up a Simple LLM Inference Benchmarking System with vLLM on Oracle Cloud Infrastructure Compute

Introduction

Understanding a system’s performance characteristics allows you to make informed choices about its components and what to expect of them. In this tutorial, we will set up a complete benchmarking system for AI large language model (LLM) inferencing. Using this, you can run various experiments from assessing the suitability of a particular Oracle Cloud Infrastructure (OCI) Compute shape for the LLM and performance requirements in mind to comparing various LLMs against each other.

In this tutorial, we use the popular open source inference server vLLM, which supports many state of the art LLMs and offers performance optimizations crucial for efficiently serving these models and is able to handle thousands of concurrent requests. Furthermore, we use Ray LLMPerf for running the benchmarks.

Note: This tutorial assumes that you have an OCI tenancy with a GPU quota. For more information about quotas, see Compute Quotas.

Objectives

Prerequisites

Task 1: Configure your Network

In this task, configure your Virtual Cloud Network (VCN) to provide a functional and secure setup to run your benchmarks. In this example, we need a VCN with one public and one private subnet. You can use the VCN setup wizard or manually set up all the components, making sure security lists allow SSH access and a NAT gateway is assigned to the private subnet.

Overview of the solution architecture comprising a private and public subnet a bastion host and the GPU VM under test

Before provisioning OCI Compute instances and proceeding with the benchmark setup, it is important to ensure that your network settings are properly configured. By default, the provisioned network allows only essential traffic between subnets. To allow the network traffic required for benchmarking, you will need to adjust these settings to permit HTTP traffic between the public and private subnets. Specifically, add an ingress rule to the security list of the private subnet that allows traffic from the VCN CIDR block to destination port 8000.

Task 2: Provision OCI Compute Instances

In this task, provision two OCI Compute instances for our benchmark setup. One being a CPU-only virtual machine (VM) instance that acts as the bastion and the benchmark client machine, the other being the GPU-equipped instance under test.

  1. Provision a VM.Flex.E5 instance with eight OCPUs and Oracle Linux 8. The bastion/benchmark client does not require a GPU since it will only be sending requests to the model hosted on a separate machine, hence the choice of a CPU shape. Make sure to select the public subnet in the network parameters and do not forget to upload your SSH public key (or download the provided private key if you prefer).

  2. Provision a VM.GPU.A10.1 instance. To streamline the setup, provision the instance using an Oracle Linux 8 image that includes NVIDIA drivers and the CUDA framework. Begin by selecting the desired shape, then navigate back to the image selection menu and choose the Oracle Linux 8 variant with built-in GPU support. This instance should be provisioned in the private subnet, meaning it can only be accessed using the bastion host. Make sure to set up the SSH key too.

Screenshot of selecting Oracle Linux 8 image with builtin GPU support

Task 3: Set up the Benchmark Client

In this task, we will install all the required components for our performance benchmarks and configure the GPU host.

  1. Log in to your bastion with your SSH key and set up all requirements for LLMPerf using the following command.

    sudo dnf install epel-release -y
    sudo yum-config-manager --enable ol8_baseos_latest ol8_appstream ol8_addons ol8_developer_EPEL
    sudo dnf install git python3.11 python3.11-devel python3.11-pip -y
    
  2. Clone the Ray LLMPerf repository, set up a Python venv and install LLMPerf using the following command.

    git clone https://github.com/ray-project/llmperf.git
    cd llmperf
    mkdir venv && python3.11 -mvenv venv && source venv/bin/activate
    
  3. Before installing LLMPerf (and its Python dependencies), edit the pyproject.toml file and remove the Python requirement clause. The clause unnecessarily limits the Python version to be less than 3.11.

    diff --git a/pyproject.toml b/pyproject.toml
    index 7687fb2..521a2a7 100644
    --- a/pyproject.toml
    +++ b/pyproject.toml
    @@ -8,7 +8,6 @@ version = "0.1.0"
     description = "A framework for load testing LLM APIs"
     authors = [{name="Avnish Narayan", email="avnish@anyscale.com"}]
     license = {text= "Apache-2.0"}
    -requires-python = ">=3.8, <3.11"
     dependencies = ["pydantic<2.5",
                     "ray",
                     "pytest>=6.0",
    
  4. Run the pip install -e command to finalize the setup.

Task 4: Set up the Benchmark Target

In this task, we will set up the benchmark target itself. As we have already provisioned the node with the necessary drivers and Compute Unified Device Architecture (CUDA) framework, we simply have to install vLLM and its python dependencies and deploy the model that we wish to benchmark.

Note: The GPU compute instance you have provisioned reside in a private subnet. So to reach it, you must first log into the bastion host and set up the private SSH key that you chose for the benchmark target. Only then will you be able to log into the benchmark target using its private IP address.

  1. Install the necessary packages, like the initial setup on the bastion host. Additionally, update the host’s firewall to permit incoming traffic on port 8000 using the following command.

    sudo dnf install epel-release -y
    sudo yum-config-manager --enable ol8_baseos_latest ol8_appstream ol8_addons ol8_developer_EPEL
    sudo dnf install git python3.11 python3.11-devel python3.11-pip -y
    sudo firewall-cmd --permanent --zone=public --add-port=8000/tcp
    sudo firewall-cmd --reload
    
  2. Run the following command to install vLLM and its requirements.

    mkdir venv
    python3.11 -mvenv venv && source venv/bin/activate
    pip install -U pip "bitsandbytes>=0.44.0" vllm gpustat mistral_common
    
  3. We are ready to start vLLM with the model we would like to test.

    export HF_TOKEN=<your huggingface token>
    export MODEL="meta-llama/Llama-3.2-3B-Instruct"
    ulimit -Sn 65536 # increase limits to avoid running out of files that can be opened
    vllm serve $MODEL --tokenizer-mode auto --config-format hf --load-format auto \
                      --enforce-eager --max-model-len 65536
    

    Note: We need to reduce the model context length to fit into the A10 GPU memory, we lower the context length from the default size of 128k tokens to 64k tokens. After loading, the model vLLM should start outputting its inference statistics at regular intervals.

    INFO 12-09 15:46:36 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
    
  4. We let the server continue running in the background while we switch to another terminal where we will now test and benchmark our setup. A simple test can be run with curl as follows.

    export HOST="<the ip address of the benchmark machine"
    export MODEL="meta-llama/Llama-3.2-3B-Instruct"
    curl --location "http://${HOST}:8000/v1/chat/completions"     --header 'Content-Type: application/json'     --header 'Authorization: Bearer token'     --data '{
            "model": "'"$MODEL"'",
            "messages": [
              {
                "role": "user",
                "content": "What is the question to the answer to the ultimate question of life, the universe, and everything. You may give a humorous response."
              }
            ]
        }' -s | jq
    

    Output will be:

    {
      "id": "chatcmpl-f11306f943604d41bad84be1dadd9da6",
      "object": "chat.completion",
      "created": 1733997638,
      "model": "meta-llama/Llama-3.2-3B-Instruct",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "You want to know the ultimate question to the answer of 42?\n\nWell, after years of intense research and contemplation, I've discovered that the answer is actually a giant, cosmic joke. The question is: \"What's for lunch?\"",
            "tool_calls": []
          },
          "logprobs": null,
          "finish_reason": "stop",
          "stop_reason": null
        }
      ],
      "usage": {
        "prompt_tokens": 62,
        "total_tokens": 112,
        "completion_tokens": 50,
        "prompt_tokens_details": null
      },
      "prompt_logprobs": null
    }
    

Task 5: Run the Benchmark

Now, we are ready to run our benchmark. Given a particular application scenario with a chosen large language model, we will like to understand on the target system what are the performance characteristics of running concurrent inference requests.

Let us benchmark following scenario:

PARAMETER VALUE
MODEL Meta LLaMa 3.2 3B Instruct
USE-CASE chat
INPUT TOKENS N(200, 40)
OUTPUT TOKENS N(100, 10)
CONCURRENT REQUESTS 1 - 32
  1. Create the following script named llm_benchmark.sh on the benchmark client (bastion host) inside the home directory.

    #!/usr/bin/env bash
    
    set -xe
    
    # Use vLLM OpenAPI endpoint
    export OPENAI_API_BASE="http://<benchmark host>:8000/v1"
    # API key is not in use, but needs to be set for llmbench
    export OPENAI_API_KEY="none"
    
    model="meta-llama/Llama-3.2-3B-Instruct"
    
    modelname="${model##*/}"
    mkdir "$modelname"
    
    concurrent_requests=(1 2 4 8 16 32)
    
    pushd llmperf
    
    source venv/bin/activate
    
    for cr in "${concurrent_requests[@]}"
    do
        python3 token_benchmark_ray.py --model $model \
            --mean-input-tokens 200 --stddev-input-tokens 40 \
            --mean-output-tokens 100 --stddev-output-tokens 10 \
            --max-num-completed-requests $((cr * 100)) --num-concurrent-requests $cr \
            --metadata "use_case=chatbot" \
            --timeout 1800 --results-dir "../$modelname/result_outputs_chat_creq_$cr" --llm-api openai
    done
    popd
    

    This script will allow you to automatically run through a series of benchmarks with llmperf with increasing number of concurrent requests (starting at 1 and successively doubling up to 32). As you can see from the arguments passed to the token_benchmark_ray.py script we are setting token inputs and outputs as defined in the table above.

  2. Run your benchmark script using the following command.

    bash -x llm_benchmark.sh
    

Once done, you will find a new directory called Llama-3.2-3B-Instruct in your home directory, where all the experiment results will be stored in JSON format, which you can download and postprocess using your favorite data analysis tool.

Note: One easy way to turn your benchmarks into a plot is to extract the figures that most interest you using a little shell script and jq into .csv format, which can easily be copy-pasted into Excel.

echo "concurrent_requests,token_throughput"
for i in *; do
    cat $i/*_summary.json | jq -r '[.num_concurrent_requests, .results_request_output_throughput_token_per_s_mean] | join(",")'
done;

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.