Note:
- This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
- It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.
Run Elyza LLM Model on OCI Compute A10.2 Instance with Oracle Resource Manager using One Click Deployment
Introduction
Oracle Cloud Infrastructure (OCI) Compute lets you create different types of shapes to test graphics Processing Unit (GPU) for Artificial Intelligence (AI) models deployed locally. In this tutorial, we will use A10.2 shape with a pre-existing VCN and subnet resources that you can select from the Oracle Resource Manager.
The Terraform code also includes configuring the instance to run a local Virtual Large Language Model (vLLM) Elyza model(s) for natural language processing tasks.
Objectives
- Create a A10.2 shape on OCI Compute, download Elyza LLM model and query the local vLLM model.
Prerequisites
-
Ensure you have an OCI Virtual Cloud Network (VCN) and a subnet where the virtual machine (VM) will be deployed.
-
Understanding of the network components and their relationships. For more information, see Networking Overview.
-
Understanding of networking in the cloud. For more information watch the following video: Video for Networking in the Cloud EP.01: Virtual Cloud Networks.
-
Requirements:
- Instance Type: A10.2 shape with two Nvidia GPUs.
- Operating System: Oracle Linux.
- Image selection: The deployment script selects the latest Oracle Linux image with GPU support.
- Tags: Adds a freeform tag GPU_TAG = “A10-2”.
- Boot Volume Size: 250GB.
- Initialization: Uses cloud-init to download and configure the vLLM Elyza model(s).
Task 1: Download the Terraform Code for One Click Deployment
Download ORM Terraform code from here: orm_stack_a10_2_gpu_elyza_models.zip, to implement Elyza vLLM model(s) locally which will allow you to select an existing VCN and a subnet to test local deployment of Elyza vLLM model(s) in an A10.2 instance shape.
Once you have the ORM Terraform code downloaded locally, follow the steps from here: Creating a Stack from a Folder to upload the stack and to execute apply of the Terraform code.
Note: Ensure you have created an OCI Virtual Cloud Network (VCN) and a subnet where the VM will be deployed.
Task 2: Create a VCN on OCI (Optional if not created already)
To create a VCN in Oracle Cloud Infrastructure, see: Video for Explore how to create a Virtual Cloud Network on OCI.
or
To create a VCN, follow the steps:
-
Log in to the OCI Console, enter Cloud Tenant Name, User Name, and Password.
-
Click the hamburger menu (≡) from the upper left corner.
-
Go to Networking, Virtual Cloud Networks and select the appropriate compartment from List Scope section.
-
Select VCN with Internet Connectivity, and click Start VCN Wizard.
-
In the Create a VCN with Internet Connectivity page, enter the following information and click Next.
- VCN NAME: Enter
OCI_HOL_VCN
. - COMPARTMENT: Select the appropriate compartment.
- VCN CIDR BLOCK: Enter
10.0.0.0/16
. - PUBLIC SUBNET CIDR BLOCK: Enter
10.0.2.0/24
. - PRIVATE SUBNET CIDR BLOCK: Enter
10.0.1.0/24
. - DNS Resolution: Select USE DNS HOSTNAMES IN THIS VCN.
- VCN NAME: Enter
-
In the Review page, review your settings and click Create.
Description of the illustration setupVCN4.png
It will take a moment to create the VCN and a progress screen will keep you apprised of the workflow.
-
Once the VCN is created, click View Virtual Cloud Network.
In real-world situations, you will create multiple VCNs based on their need for access (which ports to open) and who can access them.
Task 3: See cloud-init Configuration Details
The cloud-init
script installs all the necessary dependencies, starts Docker, downloads and starts the vLLM Elyza model(s). You can find the following code in the cloudinit.sh
file downloaded in Task 1.
dnf install -y dnf-utils zip unzip
dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
dnf remove -y runc
dnf install -y docker-ce --nobest
systemctl enable docker.service
dnf install -y nvidia-container-toolkit
systemctl start docker.service
...
Cloud-init will download all the files needed to run Elyza model and does not need your API token predefined in Hugging Face. API token will be needed for the launch of Elyza model using Docker in Task 6.
Task 4: Monitor the System
Track cloud-init completion and GPU resource usage with the following commands (if needed).
-
Monitor cloud-init completion:
tail -f /var/log/cloud-init-output.log
. -
Monitor GPUs utilization:
nvidia-smi dmon -s mu -c 100 --id 0,1
. -
Deploy and interact with the vLLM Elyza model using Python: (Change the parameters only if needed - the command is already included in the
cloud-init
script):python -O -u -m vllm.entrypoints.api_server \ --host 0.0.0.0 \ --port 8000 \ --model /home/opc/models/${MODEL} \ --tokenizer hf-internal-testing/llama-tokenizer \ --enforce-eager \ --max-num-seqs 1 \ --tensor-parallel-size 2 \ >> /home/opc/${MODEL}.log 2>&1
Task 5: Test the Model Integration
Interact with the model in the following ways using the commands or Jupyter Notebook details.
-
Test the model from Command Line Interface (CLI) once cloud-init has completed.
curl -X POST "http://0.0.0.0:8000/generate" \ -H "accept: application/json" \ -H "Content-Type: application/json" \ -d '{"prompt": "Write a humorous limerick about the wonders of GPU computing.", "max_tokens": 64, "temperature": 0.7, "top_p": 0.9}'
-
Test the model from Jupyter Notebook (Ensure to open port
8888
).import requests import json url = "http://0.0.0.0:8000/generate" headers = { "accept": "application/json", "Content-Type": "application/json", } data = { "prompt": "Write a short conclusion.", "max_tokens": 64, "temperature": 0.7, "top_p": 0.9 } response = requests.post(url, headers=headers, json=data) if response.status_code == 200: result = response.json() # Pretty print the response for better readability formatted_response = json.dumps(result, indent=4) print("Response:", formatted_response) else: print("Request failed with status code:", response.status_code) print("Response:", response.text)
-
Integrate Gradio with Chatbot to query the model.
import requests import gradio as gr import os # Function to interact with the model via API def interact_with_model(prompt): url = 'http://0.0.0.0:8000/generate' headers = { "accept": "application/json", "Content-Type": "application/json", } data = { "prompt": prompt, "max_tokens": 64, "temperature": 0.7, "top_p": 0.9 } response = requests.post(url, headers=headers, json=data) if response.status_code == 200: result = response.json() completion_text = result["text"][0].strip() # Extract the generated text return completion_text else: return {"error": f"Request failed with status code {response.status_code}"} # Retrieve the MODEL environment variable model_name = os.getenv("MODEL") # Example Gradio interface iface = gr.Interface( fn=interact_with_model, inputs=gr.Textbox(lines=2, placeholder="Write a prompt..."), outputs=gr.Textbox(type="text", placeholder="Response..."), title=f"{model_name} Interface", # Use model_name to dynamically set the title description=f"Interact with the {model_name} deployed locally via Gradio.", # Use model_name to dynamically set the description live=True ) # Launch the Gradio interface iface.launch(share=True)
Task 6: Deploy the Model using Docker (if needed)
Alternatively, deploy the model using Docker for encapsulated environments:
-
Model from external source.
docker run --gpus all \ --env "HUGGING_FACE_HUB_TOKEN=$TOKEN_ACCESS" \ -p 8000:8000 \ --ipc=host \ --restart always \ vllm/vllm-openai:latest \ --tensor-parallel-size 2 \ --model elyza/$MODEL
-
Model running with docker using the already downloaded local files (starts quicker).
docker run --gpus all \ -v /home/opc/models/$MODEL/:/mnt/model/ \ --env "HUGGING_FACE_HUB_TOKEN=$TOKEN_ACCESS" \ -p 8000:8000 \ --env "TRANSFORMERS_OFFLINE=1" \ --env "HF_DATASET_OFFLINE=1" \ --ipc=host vllm/vllm-openai:latest \ --model="/mnt/model/" \ --tensor-parallel-size 2
You can query the model in the following ways:
-
Query the model launched with Docker from CLI (this needs further attention):
-
Model started with Docker from external source.
(elyza) [opc@a10-2-gpu ~]$ curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "elyza/'${MODEL}'", "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}], "max_tokens": 64, "temperature": 0.7, "top_p": 0.9 }'
-
Model started locally with Docker.
curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "/mnt/model/", "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}], "max_tokens": 64, "temperature": 0.7, "top_p": 0.9 }'
-
-
Query the model started with Docker from a Jupyter Notebook.
-
Model started from Docker Hub.
import requests import json import os url = "http://0.0.0.0:8000/v1/chat/completions" headers = { "accept": "application/json", "Content-Type": "application/json", } # Assuming `MODEL` is an environment variable set appropriately model = f"elyza/{os.getenv('MODEL')}" data = { "model": model, "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}], "max_tokens": 64, "temperature": 0.7, "top_p": 0.9 } response = requests.post(url, headers=headers, json=data) if response.status_code == 200: result = response.json() # Extract the generated text from the response completion_text = result["choices"][0]["message"]["content"].strip() print("Generated Text:", completion_text) else: print("Request failed with status code:", response.status_code) print("Response:", response.text)
-
Container started locally with Docker.
import requests import json import os url = "http://0.0.0.0:8000/v1/chat/completions" headers = { "accept": "application/json", "Content-Type": "application/json", } # Assuming `MODEL` is an environment variable set appropriately model = f"/mnt/model/" # Adjust this based on your specific model path or name data = { "model": model, "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}], "max_tokens": 64, "temperature": 0.7, "top_p": 0.9 } response = requests.post(url, headers=headers, json=data) if response.status_code == 200: result = response.json() # Extract the generated text from the response completion_text = result["choices"][0]["message"]["content"].strip() print("Generated Text:", completion_text) else: print("Request failed with status code:", response.status_code) print("Response:", response.text)
-
-
Query the model started with Docker using Gradio integrated with Chatbot.
-
Model started with Docker from external source.
import requests import gradio as gr import os # Function to interact with the model via API def interact_with_model(prompt): url = 'http://0.0.0.0:8000/v1/chat/completions' # Update the URL to match the correct endpoint headers = { "accept": "application/json", "Content-Type": "application/json", } # Assuming `MODEL` is an environment variable set appropriately model = f"elyza/{os.getenv('MODEL')}" data = { "model": model, "messages": [{"role": "user", "content": prompt}], # Use the user-provided prompt "max_tokens": 64, "temperature": 0.7, "top_p": 0.9 } response = requests.post(url, headers=headers, json=data) if response.status_code == 200: result = response.json() completion_text = result["choices"][0]["message"]["content"].strip() # Extract the generated text return completion_text else: return {"error": f"Request failed with status code {response.status_code}"} # Retrieve the MODEL environment variable model_name = os.getenv("MODEL") # Example Gradio interface iface = gr.Interface( fn=interact_with_model, inputs=gr.Textbox(lines=2, placeholder="Write a prompt..."), outputs=gr.Textbox(type="text", placeholder="Response..."), title=f"{model_name} Interface", # Use model_name to dynamically set the title description=f"Interact with the {model_name} model deployed locally via Gradio.", # Use model_name to dynamically set the description live=True ) # Launch the Gradio interface iface.launch(share=True)
-
Container started locally with Docker using Gradio
import requests import gradio as gr import os # Function to interact with the model via API def interact_with_model(prompt): url = 'http://0.0.0.0:8000/v1/chat/completions' # Update the URL to match the correct endpoint headers = { "accept": "application/json", "Content-Type": "application/json", } # Assuming `MODEL` is an environment variable set appropriately model = "/mnt/model/" # Adjust this based on your specific model path or name data = { "model": model, "messages": [{"role": "user", "content": prompt}], "max_tokens": 64, "temperature": 0.7, "top_p": 0.9 } response = requests.post(url, headers=headers, json=data) if response.status_code == 200: result = response.json() completion_text = result["choices"][0]["message"]["content"].strip() return completion_text else: return {"error": f"Request failed with status code {response.status_code}"} # Example Gradio interface iface = gr.Interface( fn=interact_with_model, inputs=gr.Textbox(lines=2, placeholder="Write a humorous limerick about the wonders of GPU computing."), outputs=gr.Textbox(type="text", placeholder="Response..."), title="Model Interface", # Set your desired title here description="Interact with the model deployed locally via Gradio.", live=True ) # Launch the Gradio interface iface.launch(share=True)
Note: Firewall commands to open the
8888
port for Jupyter Notebook.sudo firewall-cmd --zone=public --permanent --add-port 8888/tcp sudo firewall-cmd --reload sudo firewall-cmd --list-all
-
Acknowledgments
-
Author - Bogdan Bazarca (Senior Cloud Engineer)
-
Contributors - Oracle NACI-AI-CN-DEV team
More Learning Resources
Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.
For product documentation, visit Oracle Help Center.
Run Elyza LLM Model on OCI Compute A10.2 Instance with Oracle Resource Manager using One Click Deployment
G11811-01
July 2024