Note:

This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.

Run Elyza LLM Model on OCI Compute A10.2 Instance with Oracle Resource Manager using One Click Deployment

Introduction

Oracle Cloud Infrastructure (OCI) Compute lets you create different types of shapes to test graphics Processing Unit (GPU) for Artificial Intelligence (AI) models deployed locally. In this tutorial, we will use A10.2 shape with a pre-existing VCN and subnet resources that you can select from the Oracle Resource Manager.

The Terraform code also includes configuring the instance to run a local Virtual Large Language Model (vLLM) Elyza model(s) for natural language processing tasks.

Objectives

Create a A10.2 shape on OCI Compute, download Elyza LLM model and query the local vLLM model.

Prerequisites

Ensure you have an OCI Virtual Cloud Network (VCN) and a subnet where the virtual machine (VM) will be deployed.
Understanding of the network components and their relationships. For more information, see Networking Overview.
Understanding of networking in the cloud. For more information watch the following video: Video for Networking in the Cloud EP.01: Virtual Cloud Networks.
Requirements:
- Instance Type: A10.2 shape with two Nvidia GPUs.
- Operating System: Oracle Linux.
- Image selection: The deployment script selects the latest Oracle Linux image with GPU support.
- Tags: Adds a freeform tag GPU_TAG = “A10-2”.
- Boot Volume Size: 250GB.
- Initialization: Uses cloud-init to download and configure the vLLM Elyza model(s).

Task 1: Download the Terraform Code for One Click Deployment

Download ORM Terraform code from here: orm_stack_a10_2_gpu_elyza_models.zip, to implement Elyza vLLM model(s) locally which will allow you to select an existing VCN and a subnet to test local deployment of Elyza vLLM model(s) in an A10.2 instance shape.

Once you have the ORM Terraform code downloaded locally, follow the steps from here: Creating a Stack from a Folder to upload the stack and to execute apply of the Terraform code.

Note: Ensure you have created an OCI Virtual Cloud Network (VCN) and a subnet where the VM will be deployed.

Task 2: Create a VCN on OCI (Optional if not created already)

To create a VCN in Oracle Cloud Infrastructure, see: Video for Explore how to create a Virtual Cloud Network on OCI.

To create a VCN, follow the steps:

Log in to the OCI Console, enter Cloud Tenant Name, User Name, and Password.
Click the hamburger menu (≡) from the upper left corner.
Go to Networking, Virtual Cloud Networks and select the appropriate compartment from List Scope section.
Select VCN with Internet Connectivity, and click Start VCN Wizard.
In the Create a VCN with Internet Connectivity page, enter the following information and click Next.
- VCN NAME: Enter OCI_HOL_VCN.
- COMPARTMENT: Select the appropriate compartment.
- VCN CIDR BLOCK: Enter 10.0.0.0/16.
- PUBLIC SUBNET CIDR BLOCK: Enter 10.0.2.0/24.
- PRIVATE SUBNET CIDR BLOCK: Enter 10.0.1.0/24.
- DNS Resolution: Select USE DNS HOSTNAMES IN THIS VCN.
Description of the illustration setupVCN3.png
In the Review page, review your settings and click Create.

Description of the illustration setupVCN4.png

It will take a moment to create the VCN and a progress screen will keep you apprised of the workflow.

Description of the illustration workflow.png
Once the VCN is created, click View Virtual Cloud Network.

In real-world situations, you will create multiple VCNs based on their need for access (which ports to open) and who can access them.

Task 3: See cloud-init Configuration Details

The cloud-init script installs all the necessary dependencies, starts Docker, downloads and starts the vLLM Elyza model(s). You can find the following code in the cloudinit.sh file downloaded in Task 1.

dnf install -y dnf-utils zip unzip
dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
dnf remove -y runc
dnf install -y docker-ce --nobest
systemctl enable docker.service
dnf install -y nvidia-container-toolkit
systemctl start docker.service
...

Cloud-init will download all the files needed to run Elyza model and does not need your API token predefined in Hugging Face. API token will be needed for the launch of Elyza model using Docker in Task 6.

Task 4: Monitor the System

Track cloud-init completion and GPU resource usage with the following commands (if needed).

Monitor cloud-init completion: tail -f /var/log/cloud-init-output.log.
Monitor GPUs utilization: nvidia-smi dmon -s mu -c 100 --id 0,1.

Deploy and interact with the vLLM Elyza model using Python: (Change the parameters only if needed - the command is already included in the cloud-init script):

python -O -u -m vllm.entrypoints.api_server \
                --host 0.0.0.0 \
                --port 8000 \
                --model /home/opc/models/${MODEL} \
                --tokenizer hf-internal-testing/llama-tokenizer \
                --enforce-eager \
                --max-num-seqs 1 \
                --tensor-parallel-size 2 \
                >> /home/opc/${MODEL}.log 2>&1

Task 5: Test the Model Integration

Interact with the model in the following ways using the commands or Jupyter Notebook details.

Test the model from Command Line Interface (CLI) once cloud-init has completed.

curl -X POST "http://0.0.0.0:8000/generate" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Write a humorous limerick about the wonders of GPU computing.", "max_tokens": 64, "temperature": 0.7, "top_p": 0.9}'

Test the model from Jupyter Notebook (Ensure to open port 8888).

import requests
import json

url = "http://0.0.0.0:8000/generate"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
}

data = {
    "prompt": "Write a short conclusion.",
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Pretty print the response for better readability
    formatted_response = json.dumps(result, indent=4)
    print("Response:", formatted_response)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

Integrate Gradio with Chatbot to query the model.

import requests
import gradio as gr
import os

# Function to interact with the model via API
def interact_with_model(prompt):
    url = 'http://0.0.0.0:8000/generate'
    headers = {
        "accept": "application/json",
        "Content-Type": "application/json",
    }

    data = {
        "prompt": prompt,
        "max_tokens": 64,
        "temperature": 0.7,
        "top_p": 0.9
    }

    response = requests.post(url, headers=headers, json=data)

    if response.status_code == 200:
        result = response.json()
        completion_text = result["text"][0].strip()  # Extract the generated text
        return completion_text
    else:
        return {"error": f"Request failed with status code {response.status_code}"}

# Retrieve the MODEL environment variable
model_name = os.getenv("MODEL")

# Example Gradio interface
iface = gr.Interface(
    fn=interact_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Write a prompt..."),
    outputs=gr.Textbox(type="text", placeholder="Response..."),
    title=f"{model_name} Interface",  # Use model_name to dynamically set the title
    description=f"Interact with the {model_name} deployed locally via Gradio.",  # Use model_name to dynamically set the description
    live=True
)

# Launch the Gradio interface
iface.launch(share=True)

Task 6: Deploy the Model using Docker (if needed)

Alternatively, deploy the model using Docker for encapsulated environments:

Model from external source.

docker run --gpus all \
    --env "HUGGING_FACE_HUB_TOKEN=$TOKEN_ACCESS" \
    -p 8000:8000 \
    --ipc=host \
    --restart always \
    vllm/vllm-openai:latest \
    --tensor-parallel-size 2 \
    --model elyza/$MODEL 

Model running with docker using the already downloaded local files (starts quicker).

docker run --gpus all \
-v /home/opc/models/$MODEL/:/mnt/model/ \
--env "HUGGING_FACE_HUB_TOKEN=$TOKEN_ACCESS" \
-p 8000:8000 \
--env "TRANSFORMERS_OFFLINE=1" \
--env "HF_DATASET_OFFLINE=1" \
--ipc=host vllm/vllm-openai:latest \
--model="/mnt/model/" \
--tensor-parallel-size 2

You can query the model in the following ways:

Query the model launched with Docker from CLI (this needs further attention):

Model started with Docker from external source.

(elyza) [opc@a10-2-gpu ~]$ curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
    "model": "elyza/'${MODEL}'",
    "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}'

Model started locally with Docker.

curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
    "model": "/mnt/model/",
    "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}'

Query the model started with Docker from a Jupyter Notebook.

Model started from Docker Hub.

import requests
import json
import os

url = "http://0.0.0.0:8000/v1/chat/completions"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
}

# Assuming `MODEL` is an environment variable set appropriately
model = f"elyza/{os.getenv('MODEL')}"

data = {
    "model": model,
    "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Extract the generated text from the response
    completion_text = result["choices"][0]["message"]["content"].strip()
    print("Generated Text:", completion_text)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

Container started locally with Docker.

import requests
import json
import os

url = "http://0.0.0.0:8000/v1/chat/completions"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
}

# Assuming `MODEL` is an environment variable set appropriately
model = f"/mnt/model/"  # Adjust this based on your specific model path or name

data = {
    "model": model,
    "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Extract the generated text from the response
    completion_text = result["choices"][0]["message"]["content"].strip()
    print("Generated Text:", completion_text)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

Query the model started with Docker using Gradio integrated with Chatbot.

Model started with Docker from external source.

import requests
import gradio as gr
import os

# Function to interact with the model via API
def interact_with_model(prompt):
    url = 'http://0.0.0.0:8000/v1/chat/completions'  # Update the URL to match the correct endpoint
    headers = {
        "accept": "application/json",
        "Content-Type": "application/json",
    }

    # Assuming `MODEL` is an environment variable set appropriately
    model = f"elyza/{os.getenv('MODEL')}"

    data = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],  # Use the user-provided prompt
        "max_tokens": 64,
        "temperature": 0.7,
        "top_p": 0.9
    }

    response = requests.post(url, headers=headers, json=data)

    if response.status_code == 200:
        result = response.json()
        completion_text = result["choices"][0]["message"]["content"].strip()  # Extract the generated text
        return completion_text
    else:
        return {"error": f"Request failed with status code {response.status_code}"}

# Retrieve the MODEL environment variable
model_name = os.getenv("MODEL")

# Example Gradio interface
iface = gr.Interface(
    fn=interact_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Write a prompt..."),
    outputs=gr.Textbox(type="text", placeholder="Response..."),
    title=f"{model_name} Interface",  # Use model_name to dynamically set the title
    description=f"Interact with the {model_name} model deployed locally via Gradio.",  # Use model_name to dynamically set the description
    live=True
)

# Launch the Gradio interface
iface.launch(share=True)

Container started locally with Docker using Gradio

import requests
import gradio as gr
import os

# Function to interact with the model via API
def interact_with_model(prompt):
    url = 'http://0.0.0.0:8000/v1/chat/completions'  # Update the URL to match the correct endpoint
    headers = {
        "accept": "application/json",
        "Content-Type": "application/json",
    }

    # Assuming `MODEL` is an environment variable set appropriately
    model = "/mnt/model/"  # Adjust this based on your specific model path or name

    data = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 64,
        "temperature": 0.7,
        "top_p": 0.9
    }

    response = requests.post(url, headers=headers, json=data)

    if response.status_code == 200:
        result = response.json()
        completion_text = result["choices"][0]["message"]["content"].strip()
        return completion_text
    else:
        return {"error": f"Request failed with status code {response.status_code}"}

# Example Gradio interface
iface = gr.Interface(
    fn=interact_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Write a humorous limerick about the wonders of GPU computing."),
    outputs=gr.Textbox(type="text", placeholder="Response..."),
    title="Model Interface",  # Set your desired title here
    description="Interact with the model deployed locally via Gradio.",
    live=True
)

# Launch the Gradio interface
iface.launch(share=True)

Note: Firewall commands to open the 8888 port for Jupyter Notebook.
sudo firewall-cmd --zone=public --permanent --add-port 8888/tcp
sudo firewall-cmd --reload
sudo firewall-cmd --list-all

Acknowledgments

Author - Bogdan Bazarca (Senior Cloud Engineer)
Contributors - Oracle NACI-AI-CN-DEV team

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.

Title and Copyright Information

Run Elyza LLM Model on OCI Compute A10.2 Instance with Oracle Resource Manager using One Click Deployment

G11811-01

July 2024

Oracle and/or its affiliates.