Note:

This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.

Deploy OpenAI vLLM Production Stack on Oracle Kubernetes Engine (OKE)

Introduction

Organizations adopting large language models (LLMs) for production workloads face a critical infrastructure decision: Rely on third-party inference APIs or deploy a self-hosted inference stack. Self-hosted deployments offer significant advantages: full data privacy and compliance control, sub-100 millisecond inference latency by eliminating network round-trips, predictable cost at scale, and the freedom to fine-tune and serve any open-source model without vendor lock-in.

However, building a production-grade LLM inference stack from scratch is complex. It requires GPU-aware container orchestration, intelligent request routing across multiple model replicas, persistent storage for large model weights, and continuous monitoring — all integrated and running reliably.

Oracle Cloud Infrastructure offers multiple paths for AI inference. The OCI Generative AI Service provides a fully managed experience with dedicated AI clusters isolated to your tenancy, ideal for teams that want to get started quickly with supported models. This tutorial takes the alternative approach: deploying your own inference stack on OKE. This path is designed for teams that need precise control over GPU drivers, CUDA versions, model configurations, and serving parameters, or teams that are training and fine-tuning custom models and want to serve them directly. OCI provides bare metal GPU instances with NVIDIA A10, A100, and H100 GPUs, connected by ultra-low-latency RDMA cluster networking, giving you the same level of hardware control you would have on-premises while benefiting from cloud elasticity.

The vLLM Production Stack solves the complexity of self-hosted inference by providing an open-source, Kubernetes-native platform built on vLLM, the high-throughput inference engine used in production by organizations such as Meta, Mistral AI, and IBM. It delivers up to 24x higher throughput compared to standard serving frameworks through efficient GPU memory management and KV cache optimization. Combined with OKE and OCI GPU shapes, you get a production-ready inference platform with enterprise-grade networking, storage, and security. The OCI deployment scripts used in this tutorial are contributed and maintained in the official vLLM production-stack repository.

This tutorial walks you through deploying the vLLM Production Stack on OKE, from infrastructure provisioning to running your first inference request.

Note: This tutorial provisions resources step-by-step using the OCI CLI to help you understand the full flow of OCI cloud resources required for a GPU inference deployment. For production environments, it is recommended to codify this infrastructure using Terraform or OCI Resource Manager (Shepherd) for repeatable, version-controlled deployments.

Architecture diagram showing VCN with subnets, OKE cluster, GPU node pool, vLLM pods, and load balancer

The following OCI services are used in this tutorial:

Service	Purpose
Oracle Kubernetes Engine (OKE)	Managed Kubernetes cluster for container orchestration and GPU workload scheduling
OCI Compute (GPU Shapes)	NVIDIA A10 (24GB) and A100 (80GB) GPU instances for model inference
OCI Block Volumes	Persistent storage for model weights with configurable performance tiers
OCI Virtual Cloud Network (VCN)	Network infrastructure including subnets, gateways, and security lists
OCI Load Balancer	External access to inference endpoints
OCI Bastion	Managed SSH tunnels for private cluster access
OCI Object Storage	Alternative model source using Pre-Authenticated Request (PAR) URLs

Objectives

In this tutorial, you will:

Deploy an OKE cluster with GPU-enabled node pools using OCI CLI
Configure OCI networking (VCN, subnets, gateways) for Kubernetes workloads
Install and configure the NVIDIA device plugin for GPU scheduling
Expand the GPU node filesystem to use the full boot volume capacity
Deploy the vLLM Production Stack with an OpenAI-compatible inference endpoint
Test LLM inference with API requests against the OpenAI GPT-OSS-20B model
Configure multi-GPU tensor parallelism for larger models on bare metal shapes
Use OCI Object Storage as an alternative model source
Clean up all OCI resources to avoid ongoing charges

Prerequisites

An Oracle Cloud Infrastructure account with the following:
- A compartment with permissions to create and manage OKE clusters
- GPU compute quota for your desired shape (for example, VM.GPU.A10.1 or BM.GPU.A100-v2.8). If you do not have GPU quota, request it via a support ticket
- Permissions to create VCN, subnets, internet gateways, and load balancers
- Access to OCI Block Volumes for persistent storage
- Access to OCI Object Storage (for the advanced model loading section)

Note: The example outputs and screenshots in this tutorial use us-chicago-1. You can deploy in any supported region by setting OCI_REGION. GPU capacity varies by region and availability domain, so confirm that your target GPU shape is available before deploying. Check GPU shape availability by region and be prepared to try a different availability domain (GPU_AD_INDEX) if you hit capacity errors.

Note: This tutorial provisions paid GPU resources (for example, VM.GPU.A10.1). It is not an OCI Always Free workload. Always run the cleanup steps when finished to avoid ongoing charges.

OCI CLI installed and configured with oci setup config
jq installed for JSON parsing
kubectl installed
Helm installed
An SSH key pair (for example, ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub) for bastion access. Generate one with ssh-keygen -t rsa -b 4096 if needed
Familiarity with Kubernetes concepts (pods, services, deployments, node pools)

Note: This tutorial deploys openai/gpt-oss-20b, an Apache 2.0 licensed model from OpenAI. No Hugging Face token is required. If you want to deploy gated models such as Meta Llama 3.1, you will need a Hugging Face account with an API token.

Task 1: Configure Environment Variables

Set the required OCI configuration before deploying the infrastructure.

Find your compartment OCID in the OCI Console. Navigate to Identity & Security > Compartments, then click on your target compartment and copy the OCID.
```
oci iam compartment list --query 'data[].{name:name,id:id}' --output table
```

Export the required environment variable.

export OCI_COMPARTMENT_ID="ocid1.compartment.oc1..xxxxx"

Optionally, override the default configuration by setting any of the following environment variables.

Variable	Default	Description
`OCI_REGION`	`us-ashburn-1`	OCI region for deployment
`OCI_PROFILE`	`DEFAULT`	OCI CLI configuration profile
`CLUSTER_NAME`	`production-stack`	Name of the OKE cluster
`GPU_SHAPE`	`VM.GPU.A10.1`	GPU compute shape for the node pool
`GPU_NODE_COUNT`	`1`	Number of GPU nodes in the pool
`GPU_BOOT_VOLUME_GB`	`200`	Boot volume size in GB for GPU nodes
`CPU_BOOT_VOLUME_GB`	`100`	Boot volume size in GB for CPU nodes
`GPU_AD_INDEX`	`1`	Availability domain index (0-based) for GPU placement
`PRIVATE_CLUSTER`	`true`	Set to `false` for a public Kubernetes API endpoint
`KUBERNETES_VERSION`	`v1.31.10`	Kubernetes version for the OKE cluster

For example, to deploy with two A100 GPU nodes:

export OCI_COMPARTMENT_ID="ocid1.compartment.oc1..xxxxx"
export GPU_SHAPE="BM.GPU.A100-v2.8"
export GPU_NODE_COUNT="2"

Review the available GPU shapes and select one based on your model size requirements.

Shape	GPUs	GPU Type	GPU Memory	Recommended For
`VM.GPU.A10.1`	1	NVIDIA A10	24 GB	7B–13B parameter models
`VM.GPU.A10.2`	2	NVIDIA A10	48 GB	Tensor parallel with small models
`BM.GPU4.8`	8	NVIDIA A100 40 GB	320 GB	70B models, cost-effective
`BM.GPU.A100-v2.8`	8	NVIDIA A100 80 GB	640 GB	70B+ parameter models
`BM.GPU.H100.8`	8	NVIDIA H100	640 GB	Largest models, RDMA support

Note: Bare metal shapes (BM.*) provide dedicated hardware with no virtualization overhead and support multi-GPU tensor parallelism. Virtual machine shapes (VM.*) are more cost-effective for smaller models.

Note: This tutorial uses VM.GPU.A10.1 (single NVIDIA A10 with 24 GB GPU memory) to deploy openai/gpt-oss-20b, a Mixture of Experts (MoE) model with 3.6B active parameters that typically fits on a single A10 GPU. The advanced sections demonstrate multi-GPU configurations using BM.GPU.H100.8 for larger models such as Llama 3.1 70B.

Task 2: Deploy Using the Automated Script (Quick Start)

The vLLM Production Stack includes an automated deployment script that provisions all OCI resources and deploys the inference stack with a single command. Use this approach for a quick deployment. Tasks 3 through 10 cover each step individually for users who want to customize the process.

Clone the vLLM Production Stack repository.

git clone https://github.com/vllm-project/production-stack.git
cd production-stack/deployment_on_cloud/oci

Export your compartment OCID.

export OCI_COMPARTMENT_ID="ocid1.compartment.oc1..xxxxx"

Run the deployment script.
```
./entry_point.sh setup
```
For public clusters (PRIVATE_CLUSTER=false), setup creates all infrastructure and deploys the vLLM stack in a single command. Pass the Helm values file as a second argument:
```
PRIVATE_CLUSTER=false ./entry_point.sh setup ./production_stack_specification.yaml
```
For private clusters (the default), setup creates the infrastructure but cannot reach the Kubernetes API directly. Open a separate terminal and start the tunnel, then deploy:
```
# In a separate terminal, start the SSH tunnel (auto-reconnects on drops):
./entry_point.sh tunnel

# Back in the first terminal, deploy vLLM:
./entry_point.sh deploy-vllm ./production_stack_specification.yaml
```

Verify the deployment is running.

kubectl get pods

Expected output:

NAME                                            READY   STATUS    RESTARTS   AGE
vllm-deployment-router-xxxxxxxxxx-xxxxx          1/1     Running   0          5m
vllm-gpt-oss-deployment-vllm-xxxxxxxxxx-xxxxx    1/1     Running   0          5m

Note: If both pods show Running status, your deployment is ready. Skip ahead to Task 10: Test the Inference Endpoint.

Note: GPU instances are subject to OCI capacity constraints. If the script stays in the “Waiting for GPU node” loop for more than 15 minutes, the GPU shape may not be available in the selected availability domain. Check the node pool status with oci ce node-pool get and look for “Out of host capacity” errors. To resolve this, clean up with ./entry_point.sh cleanup and redeploy with a different availability domain (for example, GPU_AD_INDEX=0 or GPU_AD_INDEX=2) or a different GPU shape (for example, GPU_SHAPE=VM.GPU.A10.2).

Note: The deployment script uses GPU instances that incur significant costs (~$50/day for a single A10 GPU). Always run ./entry_point.sh cleanup when you are done to avoid ongoing charges.

Task 3: Create VCN and Networking

Create the OCI network infrastructure required for the OKE cluster. This includes a Virtual Cloud Network (VCN), gateways, route tables, security lists, and subnets. Each networking resource is created in a few seconds; the full set of commands completes in under 2 minutes.

Create a VCN with a 10.0.0.0/16 CIDR block.

VCN_ID=$(oci network vcn create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --display-name "${CLUSTER_NAME}-vcn" \
    --cidr-blocks '["10.0.0.0/16"]' \
    --dns-label "prodstack" \
    --query "data.id" \
    --raw-output)

Create an Internet Gateway for public subnet routing.

IGW_ID=$(oci network internet-gateway create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-igw" \
    --is-enabled true \
    --query "data.id" \
    --raw-output)

Create a NAT Gateway for outbound traffic from private subnets.

NAT_ID=$(oci network nat-gateway create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-nat" \
    --query "data.id" \
    --raw-output)

Create a Service Gateway for access to Oracle Services Network. The OKE cloud controller uses Oracle Services to initialize worker nodes (set availability domain labels, remove initialization taints). Without a Service Gateway, GPU nodes may remain in an uninitialized state and block volume provisioning will fail.

SGW_SERVICE_ID=$(oci network service list \
    --query "data[?contains(name, 'All') && contains(name, 'Services')].id | [0]" \
    --raw-output)

SGW_SERVICE_NAME=$(oci network service list \
    --query "data[?contains(name, 'All') && contains(name, 'Services')].\"cidr-block\" | [0]" \
    --raw-output)

SGW_ID=$(oci network service-gateway create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-sgw" \
    --services "[{\"serviceId\": \"${SGW_SERVICE_ID}\"}]" \
    --query "data.id" \
    --raw-output)

Create route tables for private and public subnets.

PRIVATE_RT_ID=$(oci network route-table create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-private-rt" \
    --route-rules "[
        {\"cidrBlock\": \"0.0.0.0/0\", \"networkEntityId\": \"${NAT_ID}\"},
        {\"destination\": \"${SGW_SERVICE_NAME}\", \"destinationType\": \"SERVICE_CIDR_BLOCK\", \"networkEntityId\": \"${SGW_ID}\"}
    ]" \
    --query "data.id" \
    --raw-output)

PUBLIC_RT_ID=$(oci network route-table create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-public-rt" \
    --route-rules "[{\"cidrBlock\": \"0.0.0.0/0\", \"networkEntityId\": \"${IGW_ID}\"}]" \
    --query "data.id" \
    --raw-output)

Note: The private route table has two rules: a NAT Gateway route for general internet access (pulling container images, downloading models), and a Service Gateway route for direct access to Oracle Services Network. The Service Gateway route is critical. Without it, the OKE cloud controller cannot initialize worker nodes, which prevents block volume provisioning. The public route table uses the Internet Gateway for load balancer access.

Create a security list with the required ingress and egress rules for OKE.

SL_ID=$(oci network security-list create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-sl" \
    --egress-security-rules '[{"destination": "0.0.0.0/0", "protocol": "all", "isStateless": false}]' \
    --ingress-security-rules '[
        {"source": "0.0.0.0/0", "protocol": "6", "isStateless": false, "tcpOptions": {"destinationPortRange": {"min": 22, "max": 22}}, "description": "SSH access"},
        {"source": "10.0.0.0/16", "protocol": "all", "isStateless": false, "description": "VCN internal traffic"},
        {"source": "10.244.0.0/16", "protocol": "all", "isStateless": false, "description": "Kubernetes pods CIDR"},
        {"source": "10.96.0.0/16", "protocol": "all", "isStateless": false, "description": "Kubernetes services CIDR"},
        {"source": "0.0.0.0/0", "protocol": "1", "isStateless": false, "icmpOptions": {"type": 3, "code": 4}, "description": "Path MTU discovery"}
    ]' \
    --query "data.id" \
    --raw-output)

Security Note: This example security list is intentionally broad for simplicity. For production, restrict SSH to the bastion subnet and your IP range, and prefer separate security lists or NSGs per subnet so the load balancer subnet does not allow SSH from 0.0.0.0/0.

Secure default: Start by limiting SSH to your public IP and attaching SSH rules only to the bastion subnet. You can keep the Kubernetes pod/service CIDRs on the worker subnet and omit SSH entirely from the load balancer subnet.

Optional (recommended) split: Create a small SSH-only security list for the bastion subnet and a separate list for worker/LB subnets.

BASTION_SL_ID=$(oci network security-list create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-bastion-sl" \
    --egress-security-rules '[{"destination": "0.0.0.0/0", "protocol": "all", "isStateless": false}]' \
    --ingress-security-rules '[
        {"source": "YOUR_PUBLIC_IP/32", "protocol": "6", "isStateless": false, "tcpOptions": {"destinationPortRange": {"min": 22, "max": 22}}, "description": "SSH from your IP"}
    ]' \
    --query "data.id" \
    --raw-output)

WORKER_SL_ID=$(oci network security-list create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-worker-sl" \
    --egress-security-rules '[{"destination": "0.0.0.0/0", "protocol": "all", "isStateless": false}]' \
    --ingress-security-rules '[
        {"source": "10.0.0.0/16", "protocol": "all", "isStateless": false, "description": "VCN internal traffic"},
        {"source": "10.244.0.0/16", "protocol": "all", "isStateless": false, "description": "Kubernetes pods CIDR"},
        {"source": "10.96.0.0/16", "protocol": "all", "isStateless": false, "description": "Kubernetes services CIDR"},
        {"source": "0.0.0.0/0", "protocol": "1", "isStateless": false, "icmpOptions": {"type": 3, "code": 4}, "description": "Path MTU discovery"}
    ]' \
    --query "data.id" \
    --raw-output)

LB_SL_ID=$(oci network security-list create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-lb-sl" \
    --egress-security-rules '[{"destination": "0.0.0.0/0", "protocol": "all", "isStateless": false}]' \
    --ingress-security-rules '[
        {"source": "10.0.0.0/16", "protocol": "all", "isStateless": false, "description": "VCN internal traffic"},
        {"source": "0.0.0.0/0", "protocol": "6", "isStateless": false, "tcpOptions": {"destinationPortRange": {"min": 80, "max": 80}}, "description": "HTTP (public LB)"},
        {"source": "0.0.0.0/0", "protocol": "6", "isStateless": false, "tcpOptions": {"destinationPortRange": {"min": 443, "max": 443}}, "description": "HTTPS (public LB)"}
    ]' \
    --query "data.id" \
    --raw-output)

Note: If you only use an internal load balancer, replace the 0.0.0.0/0 sources above with 10.0.0.0/16 (or your VCN CIDR). Usage: Attach BASTION_SL_ID to the bastion subnet, WORKER_SL_ID to the API/worker subnets, and LB_SL_ID to the load balancer subnet.

Note: The Kubernetes pods CIDR (10.244.0.0/16) and services CIDR (10.96.0.0/16) rules are required for GPU worker nodes to register with the cluster. The ICMP type 3 code 4 rule enables path MTU discovery, which prevents packet fragmentation issues.

Create the subnets. The cluster requires four subnets: one for the Kubernetes API endpoint, one for worker nodes, one for load balancers, and one for the bastion host used to access the private cluster.

API_SUBNET_ID=$(oci network subnet create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-api-subnet" \
    --cidr-block "10.0.0.0/28" \
    --route-table-id "${PRIVATE_RT_ID}" \
    --security-list-ids "[\"${SL_ID}\"]" \
    --dns-label "kubeapi" \
    --prohibit-public-ip-on-vnic true \
    --query "data.id" \
    --raw-output)

WORKER_SUBNET_ID=$(oci network subnet create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-worker-subnet" \
    --cidr-block "10.0.10.0/24" \
    --route-table-id "${PRIVATE_RT_ID}" \
    --security-list-ids "[\"${SL_ID}\"]" \
    --dns-label "workers" \
    --prohibit-public-ip-on-vnic true \
    --query "data.id" \
    --raw-output)

LB_SUBNET_ID=$(oci network subnet create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-lb-subnet" \
    --cidr-block "10.0.20.0/24" \
    --route-table-id "${PUBLIC_RT_ID}" \
    --security-list-ids "[\"${SL_ID}\"]" \
    --dns-label "loadbalancers" \
    --query "data.id" \
    --raw-output)

BASTION_SUBNET_ID=$(oci network subnet create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --vcn-id "${VCN_ID}" \
    --display-name "${CLUSTER_NAME}-bastion-subnet" \
    --cidr-block "10.0.30.0/24" \
    --route-table-id "${PUBLIC_RT_ID}" \
    --security-list-ids "[\"${SL_ID}\"]" \
    --dns-label "bastion" \
    --query "data.id" \
    --raw-output)

Subnet	CIDR	Visibility	Purpose
API endpoint	`10.0.0.0/28`	Private	Kubernetes API server
Worker nodes	`10.0.10.0/24`	Private	GPU compute nodes
Load balancers	`10.0.20.0/24`	Public	External service access
Bastion	`10.0.30.0/24`	Public	SSH tunnel for private cluster access

Task 4: Create the OKE Cluster

Deploy a managed Kubernetes cluster on OKE using the networking resources created in Task 3. The cluster takes approximately 10 minutes to provision. This tutorial creates a private cluster (the script default), which does not consume a reserved public IP for the Kubernetes API endpoint. Private clusters are the recommended approach for production workloads because the API server is not exposed to the public internet.

Create the OKE cluster with a private endpoint.

oci ce cluster create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --name "${CLUSTER_NAME}" \
    --vcn-id "${VCN_ID}" \
    --kubernetes-version "${KUBERNETES_VERSION}" \
    --endpoint-subnet-id "${API_SUBNET_ID}" \
    --service-lb-subnet-ids "[\"${LB_SUBNET_ID}\"]" \
    --endpoint-public-ip-enabled false

The command returns a work request ID. Get the cluster ID from the cluster list.

CLUSTER_ID=$(oci ce cluster list \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --lifecycle-state CREATING \
    --query 'data[0].id' \
    --raw-output)
echo "CLUSTER_ID=${CLUSTER_ID}"

Note: Private clusters do not require a reserved public IP. Worker nodes still access the internet through the NAT Gateway to pull container images and download models. Only kubectl access requires an SSH tunnel through the bastion (configured in the next steps).

Wait for the cluster to become ACTIVE. This step takes approximately 10 minutes.

oci ce cluster get \
    --cluster-id "${CLUSTER_ID}" \
    --query "data.\"lifecycle-state\"" \
    --raw-output

Poll the command until the output returns ACTIVE.

Optional: display a concise status summary (including the private API endpoint).

oci ce cluster get \
    --cluster-id "${CLUSTER_ID}" \
    --query 'data.{name:name,state:"lifecycle-state","k8s-version":"kubernetes-version",endpoint:endpoints."private-endpoint"}' \
    --output table

OCI CLI output showing OKE cluster in ACTIVE state

kubectl get nodes -o wide

kubectl get nodes output showing two Ready nodes

Create an OCI Bastion to access the private cluster. The bastion provides a managed SSH tunnel to the private Kubernetes API endpoint.
```
BASTION_ID=$(oci bastion bastion create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --bastion-type STANDARD \
    --target-subnet-id "${BASTION_SUBNET_ID}" \
    --name "${CLUSTER_NAME}-bastion" \
    --client-cidr-list '["YOUR_PUBLIC_IP/32"]' \
    --query "data.id" \
    --raw-output)
```
Note: Replace YOUR_PUBLIC_IP/32 with your current public IP. For shared networks, use your corporate CIDR block instead.

Wait for the bastion to become ACTIVE (approximately 1 minute).
```
oci bastion bastion get \
    --bastion-id "${BASTION_ID}" \
    --query "data.\"lifecycle-state\"" \
    --raw-output
```
Security Note: For production, do not use 0.0.0.0/0. Restrict --client-cidr-list to your public IP or corporate CIDR (for example, "YOUR_PUBLIC_IP/32"), otherwise anyone on the internet can attempt a bastion session.

Download the kubeconfig using the private endpoint.

oci ce cluster create-kubeconfig \
    --cluster-id "${CLUSTER_ID}" \
    --file "${HOME}/.kube/config" \
    --region "${OCI_REGION}" \
    --token-version 2.0.0 \
    --kube-endpoint PRIVATE_ENDPOINT

Get the private endpoint IP address for the SSH tunnel.

PRIVATE_ENDPOINT=$(oci ce cluster get \
    --cluster-id "${CLUSTER_ID}" \
    --query "data.endpoints.\"private-endpoint\"" \
    --raw-output)
PRIVATE_IP="${PRIVATE_ENDPOINT%:*}"
echo "Private endpoint IP: ${PRIVATE_IP}"

Create a bastion port-forwarding session. You will need an SSH public key file.

SESSION_ID=$(oci bastion session create-port-forwarding \
    --bastion-id "${BASTION_ID}" \
    --target-private-ip "${PRIVATE_IP}" \
    --target-port 6443 \
    --session-ttl 10800 \
    --display-name "kubectl-tunnel" \
    --ssh-public-key-file ~/.ssh/id_rsa.pub \
    --query "data.id" \
    --raw-output)

Wait for the session to become ACTIVE, then get the SSH command.

oci bastion session get \
    --session-id "${SESSION_ID}" \
    --query "data.{state:\"lifecycle-state\", ssh:\"ssh-metadata\".command}" 2>&1

Open a separate terminal and start the SSH tunnel using the command from the previous step. The tunnel forwards local port 6443 to the private Kubernetes API.
```
ssh -i ~/.ssh/id_rsa -o IdentitiesOnly=yes -N -L 6443:<PRIVATE_IP>:6443 \
    -p 22 -o ServerAliveInterval=30 \
    <SESSION_OCID>@host.bastion.<REGION>.oci.oraclecloud.com
```
Note: Replace <PRIVATE_IP>, <SESSION_OCID>, and <REGION> with the values from the previous step. Keep this terminal open for the duration of your session. The -o IdentitiesOnly=yes flag prevents “too many authentication failures” errors when your SSH agent has multiple keys loaded.
Update the kubeconfig to connect through the local tunnel.
```
CLUSTER_NAME_KUBE=$(kubectl config view --minify -o jsonpath='{.clusters[0].name}')
kubectl config set-cluster "${CLUSTER_NAME_KUBE}" \
    --server=https://127.0.0.1:6443 \
    --insecure-skip-tls-verify=true
```
Note: The --insecure-skip-tls-verify flag is required because the cluster certificate was issued for the private endpoint IP, not 127.0.0.1. This is safe because traffic is encrypted through the SSH tunnel.
If you are using a non-default OCI CLI profile (for example, API_KEY_AUTH), update the kubeconfig to use it. The generated kubeconfig defaults to the DEFAULT profile for token generation.
```
kubectl config set-credentials \
    $(kubectl config view --minify -o jsonpath='{.users[0].name}') \
    --exec-env=OCI_CLI_PROFILE=${OCI_PROFILE}
```
Tip: Steps 6-9 are automated by ./entry_point.sh tunnel, which also auto-reconnects if the SSH tunnel drops during long-running operations like disk expansion. Run it in a separate terminal and leave it running for the duration of your session.
Verify cluster access.

kubectl get nodes

At this point the output will show no nodes, since the GPU node pool has not been added yet.

No resources found

Task 5: Add GPU Node Pool

Add a node pool with GPU compute instances to the OKE cluster.

Find the latest GPU-compatible OKE node image. OKE requires specific images with kubelet and node registration components pre-installed. Use the node-pool-options API to find the correct image for your Kubernetes version.
```
GPU_IMAGE_ID=$(oci ce node-pool-options get \
    --node-pool-option-id all \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --query "data.sources[?contains(\"source-name\", 'GPU') && contains(\"source-name\", 'OKE-${KUBERNETES_VERSION#v}') && contains(\"source-name\", '8.10')].\"image-id\" | [0]" \
    --raw-output)
echo "GPU Image: ${GPU_IMAGE_ID}"
```
Note: The query filters for Oracle Linux 8.10 GPU images matching your Kubernetes version (for example, OKE-1.31.10). If you need ARM-based images, replace 8.10 with the appropriate filter.
Determine the availability domain that has GPU shapes. Not all availability domains have GPU capacity.
```
AD=$(oci iam availability-domain list \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --query "data[${GPU_AD_INDEX}].name" \
    --raw-output)
echo "Availability Domain: ${AD}"
```
Note: GPU capacity varies by region and availability domain. If node pool creation fails with an “Out of host capacity” error, try a different availability domain (GPU_AD_INDEX) or GPU shape, or request capacity via your normal OCI process.

Create the GPU node pool with a 200 GB boot volume.

oci ce node-pool create \
    --compartment-id "${OCI_COMPARTMENT_ID}" \
    --cluster-id "${CLUSTER_ID}" \
    --name "${GPU_NODE_POOL_NAME:-gpu-pool}" \
    --kubernetes-version "${KUBERNETES_VERSION}" \
    --node-shape "${GPU_SHAPE}" \
    --node-image-id "${GPU_IMAGE_ID}" \
    --node-boot-volume-size-in-gbs "${GPU_BOOT_VOLUME_GB:-200}" \
    --size "${GPU_NODE_COUNT}" \
    --placement-configs "[{\"availabilityDomain\": \"${AD}\", \"subnetId\": \"${WORKER_SUBNET_ID}\"}]" \
    --initial-node-labels '[{"key": "app", "value": "gpu"}, {"key": "nvidia.com/gpu", "value": "true"}]'

Note: The node labels app=gpu and nvidia.com/gpu=true are used later by the vLLM Helm chart to schedule inference pods on GPU nodes. The 200 GB boot volume provides space for the vLLM container image (~10 GB) and model weights, but the filesystem must be expanded before use (see Task 8).

Wait for the GPU nodes to become Ready. This typically takes 5–10 minutes while the node provisions, boots, installs GPU drivers, and registers with the cluster.

Note: GPU instances are subject to capacity constraints. If the node pool stays in CREATING state, check the node status in the OCI Console or with oci ce node-pool get. An “Out of host capacity” error means no GPU instances are available in that availability domain. To resolve this, try a different availability domain (GPU_AD_INDEX=0 or GPU_AD_INDEX=2), try a different GPU shape, or request a capacity reservation through the OCI Console or a support ticket.
```
kubectl get nodes -w
```
Expected output once the node is ready:
```
NAME          STATUS   ROLES   AGE   VERSION
10.0.10.x     Ready    node    5m    v1.31.10
```

Verify that the GPU is detected on the node.

kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'

Expected output:

NAME          GPUs
10.0.10.x     1

Patch CoreDNS to schedule on GPU nodes. OKE GPU nodes have a nvidia.com/gpu=present:NoSchedule taint. In clusters that only have GPU nodes, system pods like CoreDNS cannot schedule without a toleration for this taint. Without DNS, pods cannot resolve external hostnames to download models.

kubectl patch deployment coredns -n kube-system --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/tolerations/-", "value": {"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}}]'

kubectl patch deployment kube-dns-autoscaler -n kube-system --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/tolerations/-", "value": {"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}}]'

Verify CoreDNS is running.

kubectl get pods -n kube-system | grep coredns

Note: If your cluster has a dedicated CPU node pool for system workloads, this step is not necessary. This patch is only needed when GPU nodes are the only nodes in the cluster.

oci ce node-pool list --compartment-id "${OCI_COMPARTMENT_ID}" --cluster-id "${CLUSTER_ID}" \
    --query 'data[].{name:name,state:"lifecycle-state",shape:"node-shape",size:"node-config-details".size}' --output table

OCI CLI output showing cpu-pool and gpu-pool both in ACTIVE state

Task 6: Install NVIDIA Device Plugin

Install the NVIDIA device plugin so Kubernetes can detect and schedule workloads on the GPU.

Apply the NVIDIA device plugin DaemonSet.

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

Wait for the plugin pods to be ready.
```
kubectl wait --for=condition=Ready pods -l name=nvidia-device-plugin-ds -n kube-system --timeout=300s
```
Note: Some OKE GPU node images include a pre-installed NVIDIA device plugin (nvidia-gpu-device-plugin). If the image already includes it, applying the upstream DaemonSet creates a second instance which does not cause conflicts. The automated script (entry_point.sh deploy-vllm) always installs it to ensure GPU detection works regardless of the node image version.

Confirm the GPU is allocatable by Kubernetes.

kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.allocatable.'nvidia\.com/gpu'

Expected output:

NAME          GPUs
10.0.10.x     1

Patch CoreDNS to tolerate GPU node taints. In clusters where GPU nodes are the only worker nodes, CoreDNS pods cannot schedule because OKE GPU nodes carry a nvidia.com/gpu=present:NoSchedule taint. Without DNS, pods cannot resolve image registries or model download URLs.
```
kubectl patch deployment coredns -n kube-system --type='json' -p='[
  {"op": "add", "path": "/spec/template/spec/tolerations/-", "value": {"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}}
]'
kubectl rollout status deployment/coredns -n kube-system --timeout=60s
```
Note: This step is only needed when GPU nodes are the sole worker nodes in the cluster. If you have a dedicated CPU node pool for system workloads, CoreDNS schedules there by default and this patch is unnecessary.

Task 7: Configure Storage

Apply the OCI Block Volume StorageClass to provide persistent storage for model weights.

Apply the StorageClass definition.

kubectl apply -f oci-block-storage-sc.yaml

The file defines two performance tiers:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: oci-block-storage-enc
provisioner: blockvolume.csi.oraclecloud.com
parameters:
  vpusPerGB: "10"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

StorageClass	Performance	Use Case
`oci-block-storage-enc`	Balanced (`vpusPerGB: 10`)	Default, cost-effective for most models
`oci-block-storage-hp`	High performance (`vpusPerGB: 20`)	Faster model loading for larger models

Verify the StorageClasses are available.
```
kubectl get storageclass
```

Note: For multi-node deployments requiring shared storage across multiple pods, use OCI File Storage Service (NFS) with ReadWriteMany access mode instead of Block Volumes.

Task 8: Expand GPU Node Filesystem

OCI boot volumes have a fixed ~47 GB partition regardless of the boot volume size you specify. The vLLM container image alone is approximately 10 GB, and model weights require additional space. You must expand the filesystem before deploying vLLM to avoid DiskPressure evictions.

Note: This is an OCI-specific requirement. The boot volume is provisioned at 200 GB, but the operating system only partitions ~47 GB by default. The remaining space must be claimed manually.

Verify the current filesystem size on the GPU node.

GPU_NODE=$(kubectl get nodes -l app=gpu -o jsonpath='{.items[0].metadata.name}')
kubectl run check-disk --rm -i --restart=Never \
  --image=busybox:latest \
  --overrides="{\"spec\":{\"nodeName\":\"${GPU_NODE}\",\"tolerations\":[{\"operator\":\"Exists\"}],\"containers\":[{\"name\":\"check\",\"image\":\"busybox:latest\",\"command\":[\"sh\",\"-c\",\"chroot /host df -h / | tail -1\"],\"securityContext\":{\"privileged\":true},\"volumeMounts\":[{\"name\":\"host\",\"mountPath\":\"/host\"}]}],\"volumes\":[{\"name\":\"host\",\"hostPath\":{\"path\":\"/\"}}]}}"

The output will show approximately 47 GB total, confirming expansion is needed.

Create a privileged pod on the GPU node to run the expansion commands.

kubectl run expand-disk --restart=Never \
  --image=busybox:latest \
  --overrides="{\"spec\":{\"nodeName\":\"${GPU_NODE}\",\"tolerations\":[{\"operator\":\"Exists\"}],\"containers\":[{\"name\":\"expand\",\"image\":\"busybox:latest\",\"command\":[\"sleep\",\"600\"],\"securityContext\":{\"privileged\":true},\"volumeMounts\":[{\"name\":\"host\",\"mountPath\":\"/host\"}]}],\"volumes\":[{\"name\":\"host\",\"hostPath\":{\"path\":\"/\"}}]}}"

Wait for the pod to start.

kubectl wait --for=condition=Ready pod/expand-disk --timeout=60s

Run all four expansion steps in a single kubectl exec command. Running them together avoids the risk of kubectl exec returning exit code 137 (SIGKILL) between steps, which can happen during heavy disk I/O on the host.

kubectl exec expand-disk -- chroot /host bash -c '
  set -x
  growpart /dev/sda 3 || echo "growpart: partition may already be expanded"
  sleep 3
  pvresize /dev/sda3
  lvextend -l +100%FREE /dev/ocivolume/root || echo "lvextend: may already be extended"
  xfs_growfs /
  echo "EXPANSION_COMPLETE"
  df -h /
'

Step	Command	Purpose
1	`growpart /dev/sda 3`	Expand partition 3 to use full disk
2	`pvresize /dev/sda3`	Resize LVM physical volume
3	`lvextend -l +100%FREE /dev/ocivolume/root`	Extend logical volume
4	`xfs_growfs /`	Grow XFS filesystem to fill the volume

Note: All four operations are idempotent. If the exec returns exit code 137, you can safely re-run the entire block. Look for EXPANSION_COMPLETE in the output to confirm success.

Restart kubelet so the node reports the updated allocatable storage, then verify and clean up.
```
kubectl exec expand-disk -- nsenter -t 1 -m -p -- systemctl restart kubelet 2>/dev/null \
    || echo "Warning: kubelet restart returned non-zero (non-critical)"
kubectl exec expand-disk -- chroot /host df -h /
kubectl delete pod expand-disk --force
```
Note: The nsenter command enters the host’s PID namespace to access systemd. A plain chroot /host systemctl restart kubelet fails because it cannot connect to the systemd bus from within a chroot.

Expected output should show approximately 189 GB total.

Task 9: Deploy the vLLM Production Stack

Install the vLLM inference stack using Helm.

Add the vLLM Helm repository.

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update

Review the Helm values file. The production_stack_specification.yaml configures the model, resources, and storage for OCI.

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "gpt-oss"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "openai/gpt-oss-20b"

    replicaCount: 1
    requestCPU: 4
    requestMemory: "24Gi"
    requestGPU: 1

    # No HF token needed - Apache 2.0 licensed OpenAI open source model

    pvcStorage: "100Gi"
    pvcAccessMode:
      - ReadWriteOnce
    storageClass: "oci-block-storage-enc"

    nodeSelector:
      app: gpu
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

    extraArgs:
      - "--max-model-len=8192"
      - "--gpu-memory-utilization=0.90"

Note: The openai/gpt-oss-20b model is a Mixture of Experts (MoE) model with 20B total parameters and 3.6B active parameters per forward pass. It is released under the Apache 2.0 license, so no Hugging Face token is required. The vllm/vllm-openai container image provides an OpenAI-compatible API server, allowing clients to use standard OpenAI SDK calls against your self-hosted endpoint.

Deploy the stack. Do not use --wait here because the router pod will CrashLoop until patched in the next step.
```
helm upgrade -i \
    vllm vllm/vllm-stack \
    -f production_stack_specification.yaml
```
Wait for the vLLM engine pod to start (the router will be patched next).
```
kubectl wait --for=condition=Ready pods -l model=gpt-oss --timeout=600s
```
Note: The engine pod takes several minutes to become Ready because it downloads the model weights on first start. If the pod stays in ContainerCreating, the container image (~10 GB) is still being pulled. Use kubectl describe pod <pod-name> to check progress.
Patch the router deployment. The router needs a GPU toleration (so it can schedule when GPU nodes are the only nodes with capacity) and increased memory limits (the default 500 Mi can cause OOMKill).
```
kubectl patch deployment vllm-deployment-router --type='json' -p='[
  {"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}]},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "512Mi"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"}
]'
```
Note: The GPU toleration is needed because OKE GPU nodes have a nvidia.com/gpu=present:NoSchedule taint that prevents non-GPU workloads from scheduling. Since the router does not use a GPU but needs to run somewhere, this toleration allows it to schedule on GPU nodes. In clusters with dedicated CPU node pools, this toleration is not needed.
Confirm the Helm release is deployed.
```
helm list
```

Verify the pods are running.

kubectl get pods

Expected output:

NAME                                            READY   STATUS    RESTARTS   AGE
vllm-deployment-router-xxxxxxxxxx-xxxxx          1/1     Running   0          5m
vllm-gpt-oss-deployment-vllm-xxxxxxxxxx-xxxxx    1/1     Running   0          5m

kubectl get pods

kubectl get pods showing router and engine pods both Running 1/1

Check the model loading progress in the pod logs.
```
kubectl logs -f deployment/vllm-gpt-oss-deployment-vllm
```
Wait until you see a message indicating that the model has loaded and the server is ready to accept requests.

Task 10: Test the Inference Endpoint

Validate that the deployment is serving inference requests. The vLLM Production Stack exposes an OpenAI-compatible API through the router service, so any OpenAI SDK client or curl command can interact with it.

The following diagram shows the inference request lifecycle: from client request through the router’s engine selection logic, into the vLLM engine’s prefill and decode phases, and back as a streamed response.

Sequence diagram showing inference request lifecycle from client through router to vLLM engine and GPU

List the available models to confirm the deployment is healthy.
```
kubectl get svc vllm-router-service
```
The router service provides the API gateway to all deployed models. Since the cluster uses a private endpoint, you access the service through kubectl port-forward.
Start a port-forward from your local machine to the router service. Open a new terminal (keep the SSH tunnel running in the other one) and run:
```
kubectl port-forward svc/vllm-router-service 8080:80
```
This maps localhost:8080 on your machine to port 80 on the router service inside the cluster.

Security Note: kubectl port-forward binds locally and does not expose the service publicly. This is the safest way to test when using a private cluster over a bastion tunnel.

Note: The port-forward command runs in the foreground. Keep this terminal open while testing. Press Ctrl+C to stop it when done.

In another terminal, verify the model is available by querying the models endpoint.

curl -s http://localhost:8080/v1/models | python3 -m json.tool

Expected output:

{
    "object": "list",
    "data": [
        {
            "id": "openai/gpt-oss-20b",
            "object": "model",
            "created": 1234567890,
            "owned_by": "vllm"
        }
    ]
}

Send a text completion request.

curl -s http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "prompt": "Oracle Cloud Infrastructure is",
    "max_tokens": 50
  }' | python3 -m json.tool

Expected output (abbreviated):

{
    "id": "cmpl-xxxxxxxxxxxx",
    "object": "text_completion",
    "model": "openai/gpt-oss-20b",
    "choices": [
        {
            "index": 0,
            "text": " a cloud computing platform that provides ...",
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 5,
        "completion_tokens": 50,
        "total_tokens": 55
    }
}

Send a chat completion request. This is the same format used by the OpenAI Python SDK and is the most common way to interact with LLMs programmatically.

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "What is Oracle Cloud Infrastructure in one sentence?"}],
    "max_tokens": 100
  }' | python3 -m json.tool

Expected output (abbreviated):

{
    "id": "chatcmpl-xxxxxxxxxxxx",
    "object": "chat.completion",
    "model": "openai/gpt-oss-20b",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Oracle Cloud Infrastructure (OCI) is Oracle's enterprise-grade cloud platform that provides a full range of services for building, deploying, and managing applications and workloads..."
            },
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 71,
        "completion_tokens": 100,
        "total_tokens": 171
    }
}

Both responses include the model’s generated text in the choices array, token usage statistics, and a finish_reason of either stop (the model finished naturally) or length (output was truncated at max_tokens).

Note: The model name in the API request is the full model path (openai/gpt-oss-20b), which matches the modelURL field in the Helm values. Any OpenAI-compatible client can use this endpoint by setting base_url to http://localhost:8080/v1.

Terminal output showing curl chat completion request and JSON response from openai/gpt-oss-20b

Measure end-to-end latency for a short request.
```
time curl -s http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"openai/gpt-oss-20b","messages":[{"role":"user","content":"What is Kubernetes?"}],"max_tokens":50}' > /dev/null
```
On a single A10 GPU, expect 1-3 seconds end-to-end for short completions. Time to first token (TTFT) is typically 50-200ms depending on prompt length. For higher throughput, increase replicaCount in the Helm values to add more engine replicas behind the router.

Task 11: Expose via OCI Load Balancer (Optional)

Make the inference endpoint accessible externally through an OCI Load Balancer.

Security Note: This exposes the inference API to the public internet by default. Do not enable this in production without TLS, authentication (API key/JWT/mTLS), and IP allow-listing or WAF controls. If you must expose it, front it with an ingress controller or API gateway that enforces auth and rate limits, or use an internal load balancer.

Patch the router service to use a LoadBalancer type.
```
kubectl patch svc vllm-router-service \
  -p '{"spec": {"type": "LoadBalancer"}}'
```
Note: If you want an internal load balancer, add the OCI annotation to the service (example below). This keeps the endpoint private inside the VCN.
```
kubectl annotate svc vllm-router-service \
  "service.beta.kubernetes.io/oci-load-balancer-internal"="true"
```

Wait for the external IP to be assigned.

kubectl get svc vllm-router-service -w

Expected output once the load balancer is provisioned:

NAME                   TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)
vllm-router-service    LoadBalancer   10.96.x.x      129.xxx.xxx.xx   80:xxxxx/TCP

kubectl get svc

kubectl get svc output showing router and engine services

Test the external endpoint.

curl http://<EXTERNAL-IP>/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "prompt": "Hello from OCI!",
    "max_tokens": 50
  }'

Task 12: Configure Multi-GPU Tensor Parallelism (Advanced)

Deploy larger models across multiple GPUs on bare metal shapes.

Tensor parallelism splits a model across multiple GPUs on a single node. This is required when a model’s memory requirements exceed a single GPU. For example, Meta Llama 3.1 70B requires approximately 140 GB of GPU memory, which exceeds the capacity of any single GPU but fits across 8x A100 80 GB or 8x H100 GPUs.

Create a Kubernetes secret with your Hugging Face token. Gated models such as Llama 3.1 70B require authentication.
```
kubectl create secret generic hf-token-secret \
  --from-literal=token=YOUR_HUGGINGFACE_TOKEN
```

Update the production_stack_specification.yaml with a multi-GPU configuration.

servingEngineSpec:
  modelSpec:
  - name: "llama70b"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "meta-llama/Llama-3.1-70B-Instruct"

    replicaCount: 1
    tensorParallelSize: 8

    requestCPU: 32
    requestMemory: "256Gi"
    requestGPU: 8

    hf_token:
      secretName: "hf-token-secret"
      secretKey: "token"

    pvcStorage: "500Gi"
    pvcAccessMode:
      - ReadWriteOnce
    storageClass: "oci-block-storage-enc"

    nodeSelector:
      node.kubernetes.io/instance-type: "BM.GPU.H100.8"
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

    extraArgs:
      - "--max-model-len=8192"
      - "--gpu-memory-utilization=0.95"
      - "--tensor-parallel-size=8"

Deploy with the updated values. As with Task 9, do not use --wait. The router will CrashLoop until patched.

helm upgrade -i \
    vllm vllm/vllm-stack \
    -f production_stack_specification.yaml
kubectl wait --for=condition=Ready pods -l model=llama70b --timeout=900s

Then patch the router (same as Task 9, step 4) and verify:

kubectl patch deployment vllm-deployment-router --type='json' -p='[
  {"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}]},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "512Mi"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"}
]'

Verify the pod is running and all GPUs are in use.

kubectl get pods
kubectl logs -f deployment/vllm-llama70b-deployment-vllm

Note: Ensure your OKE cluster has a node pool with the appropriate bare metal GPU shape (for example, BM.GPU.H100.8) before deploying a multi-GPU configuration.

Task 13: Use OCI Object Storage for Models (Advanced)

Load model weights from OCI Object Storage instead of downloading from Hugging Face. This is useful for private models, faster downloads within OCI, or environments without external internet access.

Upload your model weights to an OCI Object Storage bucket. Navigate to Storage > Object Storage in the OCI Console and create a bucket if you do not already have one.
Create a Pre-Authenticated Request (PAR) URL for your bucket. In the OCI Console, select your bucket, click Pre-Authenticated Requests, and create a new request with read access.

Update the production_stack_specification.yaml to use the PAR URL.

servingEngineSpec:
  modelSpec:
  - name: "custom-model"
    repository: "iad.ocir.io/YOUR_TENANCY/vllm-custom:latest"
    modelURL: "/models/custom-model"

    env:
      - name: BUCKET_PAR_URL
        value: "https://objectstorage.us-ashburn-1.oraclecloud.com/p/xxx/n/namespace/b/bucket/o"
      - name: MODEL_NAME
        value: "custom-model"

Deploy with the updated values (Without --wait. See Task 9 for why).

helm upgrade -i \
    vllm vllm/vllm-stack \
    -f production_stack_specification.yaml
kubectl wait --for=condition=Ready pods -l model=custom-model --timeout=600s

Verify the model loads from Object Storage by checking the pod logs.
```
kubectl logs -f deployment/vllm-custom-model-deployment-vllm
```

Task 14: Clean Up Resources

Remove all deployed resources to avoid ongoing charges.

Remove the Kubernetes resources using the cleanup script.
```
cd production-stack/deployment_on_cloud/oci
./clean_up.sh
```
This uninstalls the Helm release, deletes all PersistentVolumeClaims, PersistentVolumes, and custom vLLM resources.
Delete the OKE cluster and all OCI networking resources.
```
./entry_point.sh cleanup
```
This deletes the following resources in order:
- GPU node pool
- OKE cluster
- Bastion host (if created)
- Subnets (API, worker, load balancer, bastion)
- Security lists
- Service Gateway, NAT Gateway, and Internet Gateway
- Route tables
- VCN
Verify that all resources have been removed in the OCI Console under Developer Services > Kubernetes Clusters and Networking > Virtual Cloud Networks.
```
./entry_point.sh cleanup
```

Note: Ensure all resources are deleted to avoid ongoing charges. GPU instances and block volumes incur costs even when idle.

What’s Next

This tutorial deployed a functional inference stack. For production workloads, consider the following enhancements:

Monitoring: vLLM exposes Prometheus metrics at /metrics on each engine pod. Connect a Prometheus + Grafana stack to track time to first token, tokens per second, KV cache utilization, and request queue depth.
Autoscaling: Configure Kubernetes Horizontal Pod Autoscaler with custom metrics (request queue depth or GPU utilization via DCGM) to scale engine replicas automatically under load.
Security hardening: Restrict BASTION_CLIENT_CIDR to your IP range, add Kubernetes network policies to isolate the inference namespace, and store model credentials in OCI Vault instead of Kubernetes secrets.
High availability: Spread GPU nodes across multiple availability domains and increase replicaCount in the Helm values so the router can failover between engine replicas.
Cost optimization: Use ./entry_point.sh cleanup when not in use. For dev/test workloads, consider preemptible GPU instances. Monitor GPU utilization to right-size your node pool.
Infrastructure as Code: Codify this deployment with the OCI Terraform provider or OCI Resource Manager for repeatable, version-controlled infrastructure.

Acknowledgments

Author - Federico Kamelhar (Senior Principal Architect, Agentic AI)

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.

Title and Copyright Information

Deploy OpenAI vLLM Production Stack on Oracle Kubernetes Engine (OKE)

G50911-01

Oracle and/or its affiliates.

Deploy OpenAI vLLM Production Stack on Oracle Kubernetes Engine (OKE)

Introduction

Objectives

Prerequisites

Task 1: Configure Environment Variables

Task 2: Deploy Using the Automated Script (Quick Start)

Task 3: Create VCN and Networking

Task 4: Create the OKE Cluster

Task 5: Add GPU Node Pool

Task 6: Install NVIDIA Device Plugin

Task 7: Configure Storage

Task 8: Expand GPU Node Filesystem

Task 9: Deploy the vLLM Production Stack

Task 10: Test the Inference Endpoint

Task 11: Expose via OCI Load Balancer (Optional)

Task 12: Configure Multi-GPU Tensor Parallelism (Advanced)

Task 13: Use OCI Object Storage for Models (Advanced)

Task 14: Clean Up Resources

What’s Next

Related Links

Acknowledgments

More Learning Resources