附註：

此教學課程需要存取 Oracle Cloud。若要註冊免費帳戶，請參閱開始使用 Oracle Cloud Infrastructure Free Tier 。
它使用 Oracle Cloud Infrastructure 憑證、租用戶及區間的範例值。完成實驗室時，請將這些值取代為您雲端環境特有的值。

使用 Oracle Resource Manager 在 OCI Compute A10 執行處理上使用單鍵部署執行 Mistral LLM 模型

簡介

Oracle Cloud Infrastructure (OCI) Compute 可讓您建立不同類型的資源配置，以測試本機部署人工智慧 (AI) 模型的圖形處理單元 (GPU)。在本教學課程中，我們將使用 A10 資源配置搭配預先存在的 VCN 和子網路資源，您可以從 Oracle Resource Manager 選取這些資源配置。

Terraform 程式碼也包含設定執行處理，以執行自然語言處理工作的本機虛擬大型語言模型 (vLLM) 錯誤模型。

目標

在 OCI Compute 上建立 A10 資源配置，下載 Mistral AI LLM 模型並查詢本機 vLLM 模型。

必要條件

請確定您的 OCI 虛擬雲端網路 (VCN) 和將部署虛擬機器 (VM) 的子網路。
瞭解網路元件及其關係。如需詳細資訊，請參閱 Networking Overview 。
瞭解雲端中的網路。如需詳細資訊，請觀看下列影片：雲端 EP.01 中網路的影片：虛擬雲端網路。
需求：
- 執行處理類型： A10 資源配置與一個 Nvidia GPU。
- 作業系統：Oracle Linux。
- 映像檔選擇項目：部署命令檔會選取支援 GPU 的最新 Oracle Linux 映像檔。
- 標記：新增自由格式標記 GPU_TAG = "A10-1"。
- 啟動磁碟區大小： 250GB。
- 初始化：使用 cloud-init 來下載並設定 vLLM Mistral 模型。

作業 1：下載單鍵部署的 Terraform 代碼

從此處下載 ORM Terraform 程式碼：orm_stack_a10_gpu-main.zip ，以在本機實行 Mistral vLLM 模型，這可讓您選取現有的 VCN 和子網路，以測試 A10 執行處理資源配置中 Mistral vLLM 模型的本機部署。

在本機下載 ORM Terraform 程式碼之後，請依照下列步驟進行：從資料夾建立堆疊以上傳堆疊並執行 Terraform 程式碼的套用。

注意： 請確定您已建立 OCI 虛擬雲端網路 (VCN) 和將部署 VM 的子網路。

作業 2：在 OCI 上建立 VCN (若尚未建立則為選擇性)

若要在 Oracle Cloud Infrastructure 中建立 VCN，請參閱：瞭解如何在 OCI 上建立虛擬雲端網路的影片。

或

若要建立 VCN，請執行下列步驟：

登入 OCI 主控台，輸入雲端用戶名稱、使用者名稱和密碼。
按一下左上角的漢堡選單 (按)。
請前往網路、虛擬雲端網路，然後從清單範圍區段選取適當的區間。
選取具備網際網路連線的 VCN ，然後按一下啟動 VCN 精靈。
在使用網際網路連線建立 VCN 頁面中，輸入下列資訊，然後按一下下一步。
- VCN 名稱：輸入 OCI_HOL_VCN。
- COMPARTMENT：選取適當的區間。
- VCN CIDR 區塊：輸入 10.0.0.0/16。
- PUBLIC SUBNET CIDR BLOCK：輸入 10.0.2.0/24。
- PRIVATE SUBNET CIDR BLOCK：輸入 10.0.1.0/24。
- DNS 解析：選取在此 VCN 使用 DNS 主機。
setupVCN3.png 圖解描述
在複查頁面中，複查您的設定值，然後按一下建立。

setupVCN4.png 圖解描述

建立 VCN 需要一些時間，而且進度畫面會讓您知道工作流程。

workflow.png 圖解描述
建立 VCN 之後，請按一下檢視虛擬雲端網路。

在現實世界中，您將根據對存取 (要開啟哪些連接埠) 的需求和可存取的人員來建立多個 VCN。

工作 3：查看 cloud-init 組態詳細資訊

cloud-init 命令檔會安裝所有必要的相依性、啟動 Docker、下載並啟動 vLLM Mistral 模型。您可以在任務 1 下載的 cloudinit.sh 檔案中找到下列程式碼。

dnf install -y dnf-utils zip unzip
dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
dnf remove -y runc
dnf install -y docker-ce --nobest
systemctl enable docker.service
dnf install -y nvidia-container-toolkit
systemctl start docker.service
...

Cloud-init 將根據 Hugging Face 中預先定義的 API 權杖下載執行 Mistral 模型所需的所有檔案。

API 權杖建立將會根據您從 ORM GUI 輸入來選取錯誤模型，以允許本機下載模型檔案所需的認證。如需詳細資訊，請參閱使用者存取權杖。

作業 4：監督系統

使用下列命令 (如有需要) 追蹤 cloud-init 命令檔完成和 GPU 資源使用狀況。

監控 Cloud-init 完成度：tail -f /var/log/cloud-init-output.log。
監控 GPU 使用率：nvidia-smi dmon -s mu -c 100。

使用 Python 部署 vLLM Mistral 模型並與其互動： (僅在需要時變更參數 (cloud-init 命令檔中已包含該命令)：

python -O -u -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --model "/home/opc/models/${MODEL}" \
    --tokenizer hf-internal-testing/llama-tokenizer \
    --max-model-len 16384 \
    --enforce-eager \
    --gpu-memory-utilization 0.8 \
    --max-num-seqs 2 \
    >> "${MODEL}.log" 2>&1 &

作業 5：測試模型整合

使用命令或 Jupyter Notebook 詳細資訊，以下列方式與模型互動。

在 cloud-init 命令檔完成之後，從命令行介面 (CLI) 測試模型。

curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
    "model": "/home/opc/models/'"$MODEL"'",
    "messages": [{"role":"user", "content":"Write a small poem."}],
    "max_tokens": 64
}'

測試 Jupyter Notebook 的模型 (確定開啟連接埠 8888) 。

import requests
import json
import os

# Retrieve the MODEL environment variable
model = os.environ.get('MODEL')

url = 'http://0.0.0.0:8000/v1/chat/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
}

data = {
    "model": f"/home/opc/models/{model}",
    "messages": [{"role": "user", "content": "Write a short conclusion."}],
    "max_tokens": 64
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Pretty print the response for better readability
    formatted_response = json.dumps(result, indent=4)
    print("Response:", formatted_response)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

整合 Gradio 與聊天機器人來查詢模型。

import requests
import gradio as gr
import os

def interact_with_model(prompt):
    model = os.getenv("MODEL")  # Retrieve the MODEL environment variable within the function
    url = 'http://0.0.0.0:8000/v1/chat/completions'
    headers = {
        'accept': 'application/json',
        'Content-Type': 'application/json',
    }

    data = {
        "model": f"/home/opc/models/{model}",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 64
    }

    response = requests.post(url, headers=headers, json=data)

    if response.status_code == 200:
        result = response.json()
        completion_text = result["choices"][0]["message"]["content"].strip()  # Extract the generated text
        return completion_text
    else:
        return {"error": f"Request failed with status code {response.status_code}"}

# Example Gradio interface
iface = gr.Interface(
    fn=interact_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Write a prompt..."),
    outputs=gr.Textbox(type="text", placeholder="Response..."),
    title="Mistral 7B Chat Interface",
    description="Interact with the Mistral 7B model deployed locally via Gradio.",
    live=True
)

# Launch the Gradio interface
iface.launch(share=True)

工作 6：使用 Docker 部署模型 (如有需要)

或者，使用 Docker 和外部來源部署模型。

docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$ACCESS_TOKEN" \
    -p 8000:8000 \
    --ipc=host \
    --restart always \
    vllm/vllm-openai:latest \
    --model mistralai/$MODEL \
    --max-model-len 16384

您可以使用下列方式查詢模型：

使用 CLI 查詢 Docker 和外部來源所啟動的模型。

curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "mistralai/'"$MODEL"'",
"messages": [{"role": "user", "content": "Write a small poem."}],
"max_tokens": 64
}'

使用 Jupyter Notebook 從外部來源使用 Docker 查詢模型。

import requests
import json
import os

# Retrieve the MODEL environment variable
model = os.environ.get('MODEL')

url = 'http://0.0.0.0:8000/v1/chat/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
}

data = {
    "model": f"mistralai/{model}",
    "messages": [{"role": "user", "content": "Write a short conclusion."}],
    "max_tokens": 64
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Pretty print the response for better readability
    formatted_response = json.dumps(result, indent=4)
    print("Response:", formatted_response)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

使用 Jupyter Notebook 和 Gradio Chatbot 以外部來源的 Docker 查詢模型。

import requests
import gradio as gr
import os

# Function to interact with the model via API
def interact_with_model(prompt):
    url = 'http://0.0.0.0:8000/v1/chat/completions'
    headers = {
        "accept": "application/json",
        "Content-Type": "application/json",
    }

    # Retrieve the MODEL environment variable
    model = os.environ.get('MODEL')

    data = {
        "model": f"mistralai/{model}",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 64
    }

    response = requests.post(url, headers=headers, json=data)

    if response.status_code == 200:
        result = response.json()
        completion_text = result["choices"][0]["message"]["content"].strip()  # Extract the generated text
        return completion_text
    else:
        return {"error": f"Request failed with status code {response.status_code}"}

# Example Gradio interface
iface = gr.Interface(
    fn=interact_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Write a prompt..."),
    outputs=gr.Textbox(type="text", placeholder="Response..."),
    title="Model Interface",  # Set a title for your Gradio interface
    description="Interact with the model deployed via Gradio.",  # Set a description
    live=True
)

# Launch the Gradio interface
iface.launch(share=True)

使用 docker 執行的模型使用已經下載的本機檔案 (啟動速度較快) 。

docker run --gpus all \
-v /home/opc/models/$MODEL/:/mnt/model/ \
--env "HUGGING_FACE_HUB_TOKEN=$TOKEN_ACCESS" \
-p 8000:8000 \
--env "TRANSFORMERS_OFFLINE=1" \
--env "HF_DATASET_OFFLINE=1" \
--ipc=host vllm/vllm-openai:latest \
--model="/mnt/model/" \
--max-model-len 16384 \
--tensor-parallel-size 2

使用本機檔案和 CLI 使用 Docker 查詢模型。

curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
>     "model": "/mnt/model/",
>     "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
>      "max_tokens": 64,
>     "temperature": 0.7,
>      "top_p": 0.9
>  }'

使用本機檔案和 Jupyter 筆記型電腦使用 Docker 查詢模型。

import requests
import json
import os

url = "http://0.0.0.0:8000/v1/chat/completions"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
}

# Assuming `MODEL` is an environment variable set appropriately
model = f"/mnt/model/"  # Adjust this based on your specific model path or name

data = {
    "model": model,
    "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Extract the generated text from the response
    completion_text = result["choices"][0]["message"]["content"].strip()
    print("Generated Text:", completion_text)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

使用 Jupyter Notebook 和 Gradio Chatbot 以外部來源的 Docker 查詢模型。

import requests
import gradio as gr
import os

# Function to interact with the model via API
def interact_with_model(prompt):
    url = 'http://0.0.0.0:8000/v1/chat/completions'  # Update the URL to match the correct endpoint
    headers = {
        "accept": "application/json",
        "Content-Type": "application/json",
    }

    # Assuming `MODEL` is an environment variable set appropriately
    model = "/mnt/model/"  # Adjust this based on your specific model path or name

    data = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 64,
        "temperature": 0.7,
        "top_p": 0.9
    }

    response = requests.post(url, headers=headers, json=data)

    if response.status_code == 200:
        result = response.json()
        completion_text = result["choices"][0]["message"]["content"].strip()
        return completion_text
    else:
        return {"error": f"Request failed with status code {response.status_code}"}

# Example Gradio interface
iface = gr.Interface(
    fn=interact_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Write a humorous limerick about the wonders of GPU computing."),
    outputs=gr.Textbox(type="text", placeholder="Response..."),
    title="Model Interface",  # Set your desired title here
    description="Interact with the model deployed locally via Gradio.",
    live=True
)

# Launch the Gradio interface
iface.launch(share=True)

注意：用以開啟 Jupyter Notebook 之 8888 連接埠的防火牆命令。
sudo firewall-cmd --zone=public --permanent --add-port 8888/tcp
sudo firewall-cmd --reload
sudo firewall-cmd --list-all

認可

作者 - Bogdan Bazarca (資深雲端工程師)
貢獻者 - Oracle NACI-AI-CN-DEV 團隊

其他學習資源

瀏覽 docs.oracle.com/learn 的其他實驗室，或前往 Oracle Learning YouTube 頻道存取更多免費學習內容。此外，請造訪 education.oracle.com/learning-explorer 以成為 Oracle Learning Explorer。

如需產品文件，請造訪 Oracle Help Center 。

標題與著作權資訊

Run Mistral LLM Model on OCI Compute A10 Instance with Oracle Resource Manager using One Click Deployment

G11820-01

July 2024

Oracle 和 (或) 其關係企業。