注意：

本教程需要访问 Oracle Cloud。要注册免费账户，请参阅开始使用 Oracle Cloud Infrastructure 免费套餐。
它对 Oracle Cloud Infrastructure 身份证明、租户和区间使用示例值。完成实验室后，请使用特定于云环境的那些值替换这些值。

使用一键部署通过 Oracle Resource Manager 在 OCI Compute A10 实例上运行 Mistral LLM 模型

简介

Oracle Cloud Infrastructure (OCI) Compute 支持您创建不同类型的配置，以测试本地部署的人工智能 (AI) 模型的图形处理单元 (GPU)。在本教程中，我们将 A10 配置与预先存在的 VCN 和子网资源结合使用，您可以从 Oracle 资源管理器中选择这些资源。

Terraform 代码还包括将实例配置为为自然语言处理任务运行本地虚拟大语言模型 (vLLM) 分散模型。

目标

在 OCI Compute 上创建 A10 配置，下载 Mistral AI LLM 模型并查询本地 vLLM 模型。

先决条件

确保您具有 OCI 虚拟云网络 (VCN) 和将部署虚拟机 (VM) 的子网。
了解网络组件及其关系。有关更多信息，请参见 Networking Overview 。
了解云中的网络。有关详细信息，请观看以下视频：云中的网络视频 EP.01：虚拟云网络。
要求：
- 实例类型：具有一个 Nvidia GPU 的 A10 配置。
- 操作系统：Oracle Linux。
- 映像选择：部署脚本选择支持 GPU 的最新 Oracle Linux 映像。
- 标记：添加自由格式的标记 GPU_TAG = "A10-1"。
- 引导卷大小： 250GB。
- 初始化：使用 cloud-init 下载和配置 vLLM Mistral 模型。

任务 1：下载用于一键部署的 Terraform 代码

从以下位置下载 ORM Terraform 代码：orm_stack_a10_gpu-main.zip ，以在本地实施 Mistral vLLM 模型，这将允许您选择现有 VCN 和子网以测试在 A10 实例配置中本地部署 Mistral vLLM 模型。

在本地下载 ORM Terraform 代码后，请按照以下步骤操作：从文件夹创建堆栈上载堆栈并执行 Terraform 代码应用。

注：请确保已创建 OCI 虚拟云网络 (VCN) 和将部署 VM 的子网。

任务 2：在 OCI 上创建 VCN（如果尚未创建，则为可选）

要在 Oracle Cloud Infrastructure 中创建 VCN，请参阅：视频了解如何在 OCI 上创建虚拟云网络

或者

要创建 VCN，请执行以下操作：

登录到 OCI 控制台，输入云租户名称、用户名和密码。
单击左上角的汉堡菜单（≡）。
转到网络、虚拟云网络，然后从列表范围部分中选择适当的区间。
选择 VCN with Internet Connectivity ，然后单击 Start VCN Wizard（启动 VCN 向导）。
在创建具有 Internet 连接的 VCN 页中，输入以下信息并单击下一步。
- VCN 名称：输入 OCI_HOL_VCN。
- COMPARTMENT(COMPARTMENT)：选择相应的区间。
- VCN CIDR 块：输入 10.0.0.0/16。
- PUBLIC SUBNET CIDR BLOCK：输入 10.0.2.0/24。
- PRIVATE SUBNET CIDR BLOCK：输入 10.0.1.0/24。
- DNS 解析：选择 USE DNS HOSTNAMES IN THIS VCN 。
插图 setupVCN3.png 的说明
在复查页中，复查您的设置，然后单击创建。

插图 setupVCN4.png 的说明

创建 VCN 需要一些时间，而进度屏幕会让您随时了解工作流。

插图 workflow.png 的说明
创建 VCN 后，单击查看虚拟云网络。

在实际情况下，您将根据访问需求（要打开哪些端口）以及谁可以访问这些 VCN 来创建多个 VCN。

任务 3：请参见 cloud-init 配置详细信息

cloud-init 脚本将安装所有必需的依赖项，启动 Docker，下载并启动 vLLM Mistral 模型。您可以在任务 1 中下载的 cloudinit.sh 文件中找到以下代码。

dnf install -y dnf-utils zip unzip
dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
dnf remove -y runc
dnf install -y docker-ce --nobest
systemctl enable docker.service
dnf install -y nvidia-container-toolkit
systemctl start docker.service
...

Cloud-init 将根据 Hugging Face 中预定义的 API 标记下载运行 Mistral 模型所需的所有文件。

API 令牌创建将根据您从 ORM GUI 输入选择 Mistral 模型，从而允许在本地下载模型文件所需的验证。有关详细信息，请参阅用户访问令牌。

任务 4：监视系统

使用以下命令（如果需要）跟踪 cloud-init 脚本完成情况和 GPU 资源使用情况。

监视 cloud-init 完成：tail -f /var/log/cloud-init-output.log。
监视 GPU 使用率：nvidia-smi dmon -s mu -c 100。

使用 Python 部署 vLLM Mistral 模型并与之交互：（仅在需要时更改参数（该命令已包含在 cloud-init 脚本中）：

python -O -u -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --model "/home/opc/models/${MODEL}" \
    --tokenizer hf-internal-testing/llama-tokenizer \
    --max-model-len 16384 \
    --enforce-eager \
    --gpu-memory-utilization 0.8 \
    --max-num-seqs 2 \
    >> "${MODEL}.log" 2>&1 &

任务 5：测试模型集成

使用命令或 Jupyter Notebook 详细信息通过以下方式与模型交互。

cloud-init 脚本完成后，从命令行界面 (Command Line Interface，CLI) 测试模型。

curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
    "model": "/home/opc/models/'"$MODEL"'",
    "messages": [{"role":"user", "content":"Write a small poem."}],
    "max_tokens": 64
}'

从 Jupyter Notebook 测试模型（确保打开端口 8888）。

import requests
import json
import os

# Retrieve the MODEL environment variable
model = os.environ.get('MODEL')

url = 'http://0.0.0.0:8000/v1/chat/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
}

data = {
    "model": f"/home/opc/models/{model}",
    "messages": [{"role": "user", "content": "Write a short conclusion."}],
    "max_tokens": 64
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Pretty print the response for better readability
    formatted_response = json.dumps(result, indent=4)
    print("Response:", formatted_response)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

将 Gradio 与聊天机器人集成以查询模型。

import requests
import gradio as gr
import os

def interact_with_model(prompt):
    model = os.getenv("MODEL")  # Retrieve the MODEL environment variable within the function
    url = 'http://0.0.0.0:8000/v1/chat/completions'
    headers = {
        'accept': 'application/json',
        'Content-Type': 'application/json',
    }

    data = {
        "model": f"/home/opc/models/{model}",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 64
    }

    response = requests.post(url, headers=headers, json=data)

    if response.status_code == 200:
        result = response.json()
        completion_text = result["choices"][0]["message"]["content"].strip()  # Extract the generated text
        return completion_text
    else:
        return {"error": f"Request failed with status code {response.status_code}"}

# Example Gradio interface
iface = gr.Interface(
    fn=interact_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Write a prompt..."),
    outputs=gr.Textbox(type="text", placeholder="Response..."),
    title="Mistral 7B Chat Interface",
    description="Interact with the Mistral 7B model deployed locally via Gradio.",
    live=True
)

# Launch the Gradio interface
iface.launch(share=True)

任务 6：使用 Docker 部署模型（如果需要）

或者，使用 Docker 和外部源部署模型。

docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$ACCESS_TOKEN" \
    -p 8000:8000 \
    --ipc=host \
    --restart always \
    vllm/vllm-openai:latest \
    --model mistralai/$MODEL \
    --max-model-len 16384

可以通过以下方式查询模型：

使用 CLI 查询以 Docker 和外部源启动的模型。

curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "mistralai/'"$MODEL"'",
"messages": [{"role": "user", "content": "Write a small poem."}],
"max_tokens": 64
}'

使用 Jupyter Notebook 从外部源使用 Docker 查询模型。

import requests
import json
import os

# Retrieve the MODEL environment variable
model = os.environ.get('MODEL')

url = 'http://0.0.0.0:8000/v1/chat/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
}

data = {
    "model": f"mistralai/{model}",
    "messages": [{"role": "user", "content": "Write a short conclusion."}],
    "max_tokens": 64
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Pretty print the response for better readability
    formatted_response = json.dumps(result, indent=4)
    print("Response:", formatted_response)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

使用 Jupyter Notebook 和 Gradio Chatbot 从外部源使用 Docker 查询模型。

import requests
import gradio as gr
import os

# Function to interact with the model via API
def interact_with_model(prompt):
    url = 'http://0.0.0.0:8000/v1/chat/completions'
    headers = {
        "accept": "application/json",
        "Content-Type": "application/json",
    }

    # Retrieve the MODEL environment variable
    model = os.environ.get('MODEL')

    data = {
        "model": f"mistralai/{model}",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 64
    }

    response = requests.post(url, headers=headers, json=data)

    if response.status_code == 200:
        result = response.json()
        completion_text = result["choices"][0]["message"]["content"].strip()  # Extract the generated text
        return completion_text
    else:
        return {"error": f"Request failed with status code {response.status_code}"}

# Example Gradio interface
iface = gr.Interface(
    fn=interact_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Write a prompt..."),
    outputs=gr.Textbox(type="text", placeholder="Response..."),
    title="Model Interface",  # Set a title for your Gradio interface
    description="Interact with the model deployed via Gradio.",  # Set a description
    live=True
)

# Launch the Gradio interface
iface.launch(share=True)

使用已下载的本地文件通过 docker 运行的模型（启动速度更快）。

docker run --gpus all \
-v /home/opc/models/$MODEL/:/mnt/model/ \
--env "HUGGING_FACE_HUB_TOKEN=$TOKEN_ACCESS" \
-p 8000:8000 \
--env "TRANSFORMERS_OFFLINE=1" \
--env "HF_DATASET_OFFLINE=1" \
--ipc=host vllm/vllm-openai:latest \
--model="/mnt/model/" \
--max-model-len 16384 \
--tensor-parallel-size 2

使用本地文件和 CLI 通过 Docker 查询模型。

curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
>     "model": "/mnt/model/",
>     "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
>      "max_tokens": 64,
>     "temperature": 0.7,
>      "top_p": 0.9
>  }'

使用本地文件和 Jupyter Notebook 通过 Docker 查询模型。

import requests
import json
import os

url = "http://0.0.0.0:8000/v1/chat/completions"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
}

# Assuming `MODEL` is an environment variable set appropriately
model = f"/mnt/model/"  # Adjust this based on your specific model path or name

data = {
    "model": model,
    "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Extract the generated text from the response
    completion_text = result["choices"][0]["message"]["content"].strip()
    print("Generated Text:", completion_text)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)