주:

이 자습서에서는 Oracle Cloud에 액세스해야 합니다. 무료 계정에 등록하려면 Oracle Cloud Infrastructure Free Tier 시작하기를 참조하십시오.
Oracle Cloud Infrastructure 자격 증명, 테넌시 및 구획에 예제 값을 사용합니다. 실습을 완료했으면 이러한 값을 자신의 클라우드 환경과 관련된 값으로 대체하십시오.

One Click Deployment를 사용하여 Oracle Resource Manager로 OCI Compute A10.2 인스턴스에서 Elyza LLM 모델 실행

소개

Oracle Cloud Infrastructure(OCI) Compute를 사용하면 로컬에 배포된 AI(인공 지능) 모델의 GPU(그래픽 처리 장치)를 테스트하기 위해 다양한 유형의 구성을 생성할 수 있습니다. 이 사용지침서에서는 Oracle Resource Manager에서 선택할 수 있는 기존 VCN 및 서브넷 리소스와 함께 A10.2 구성을 사용합니다.

Terraform 코드에는 자연어 처리 작업을 위해 로컬 vLLM(Virtual Large Language Model) Elyza 모델을 실행하도록 인스턴스를 구성하는 것도 포함됩니다.

목표

OCI Compute에서 A10.2 구성을 생성하고, Elyza LLM 모델을 다운로드하고, 로컬 vLLM 모델을 쿼리하세요.

필요 조건

OCI VCN(가상 클라우드 네트워크) 및 VM(가상 머신)이 배치될 서브넷이 있는지 확인합니다.
네트워크 구성 요소 및 해당 관계에 대한 이해 자세한 내용은 Networking Overview를 참조하십시오.
클라우드에서의 네트워킹에 대한 이해. 자세한 내용은 다음 비디오를 시청하십시오. 클라우드 EP.01의 네트워킹 비디오: 가상 클라우드 네트워크.
요구사항:
- 인스턴스 유형: 두 개의 Nvidia GPU가 포함된 A10.2 구성입니다.
- 운영 체제: Oracle Linux.
- 이미지 선택: 배포 스크립트가 GPU 지원이 포함된 최신 Oracle Linux 이미지를 선택합니다.
- 태그: 자유 형식 태그 GPU_TAG = "A10-2"를 추가합니다.
- 부트 볼륨 크기: 250GB.
- 초기화: cloud-init를 사용하여 vLLM Elyza 모델을 다운로드하고 구성합니다.

태스크 1: 한 번의 클릭으로 배포할 Terraform 코드 다운로드

orm_stack_a10_2_gpu_elyza_models.zip에서 ORM Terraform 코드를 다운로드하여 로컬로 Elyza vLLM 모델을 구현합니다. 이를 통해 기존 VCN 및 서브넷을 선택하여 A10.2 인스턴스 구성에서 Elyza vLLM 모델의 로컬 배포를 테스트할 수 있습니다.

ORM Terraform 코드를 로컬에서 다운로드한 후에는 폴더에서 스택 생성 단계를 따라 스택을 업로드하고 Terraform 코드 적용을 실행합니다.

주: OCI VCN(가상 클라우드 네트워크) 및 VM이 배치될 서브넷을 생성했는지 확인하십시오.

작업 2: OCI에서 VCN 생성(아직 생성되지 않은 경우 선택 사항)

Oracle Cloud Infrastructure에서 VCN을 생성하려면 OCI에서 가상 클라우드 네트워크를 생성하는 방법 살펴보기 비디오를 참조하십시오.

또는

VCN을 생성하려면 다음과 같이 하십시오.

OCI 콘솔에 로그인하여 클라우드 테넌트 이름, 사용자 이름 및 비밀번호를 입력합니다.
왼쪽 상단 모서리에서 햄버거 메뉴 (≡)를 클릭합니다.
네트워킹, 가상 클라우드 네트워크로 이동하고 목록 범위 섹션에서 적합한 컴파트먼트를 선택합니다.
VCN with Internet Connectivity를 선택하고 Start VCN Wizard를 누릅니다.
인터넷 연결을 사용하여 VCN 생성 페이지에서 다음 정보를 입력하고 다음을 누릅니다.
- VCN 이름: OCI_HOL_VCN을 입력하십시오.
- COMPARTMENT: 적절한 구획을 선택합니다.
- VCN CIDR 블록: 10.0.0.0/16을 입력합니다.
- PUBLIC SUBNET CIDR BLOCK: 10.0.2.0/24를 입력합니다.
- 개인 서브넷 CIDR 블록: 10.0.1.0/24를 입력합니다.
- DNS 분석: 이 VCN의 DNS 호스트 이름 사용을 선택합니다.
그림 setupVCN3.png에 대한 설명
검토 페이지에서 설정을 검토하고 생성을 누릅니다.

그림 setupVCN4.png에 대한 설명

VCN을 생성하는 데 시간이 걸리며 진행 화면이 계속해서 워크플로우를 인식하게 됩니다.

그림 workflow.png에 대한 설명
VCN이 생성되면 가상 클라우드 네트워크 보기를 누릅니다.

실제 상황에서는 액세스 필요성(열릴 포트) 및 액세스할 수 있는 사용자에 따라 여러 VCN을 생성합니다.

작업 3: cloud-init 구성 세부정보 보기

cloud-init 스크립트는 필요한 종속성을 모두 설치하고, Docker를 시작하고, vLLM Elyza 모델을 다운로드하고 시작합니다. 작업 1에서 다운로드한 cloudinit.sh 파일에서 다음 코드를 찾을 수 있습니다.

dnf install -y dnf-utils zip unzip
dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
dnf remove -y runc
dnf install -y docker-ce --nobest
systemctl enable docker.service
dnf install -y nvidia-container-toolkit
systemctl start docker.service
...

Cloud-init은 Elyza 모델을 실행하는 데 필요한 모든 파일을 다운로드하며 Hugging Face에 미리 정의된 API 토큰이 필요하지 않습니다. 작업 6에서 Docker를 사용하여 Elyza 모델을 실행하려면 API 토큰이 필요합니다.

작업 4: 시스템 모니터

다음 명령(필요한 경우)을 사용하여 클라우드 초기화 완료 및 GPU 리소스 사용을 추적합니다.

Cloud-Init 완료 모니터링: tail -f /var/log/cloud-init-output.log
GPU 활용도 모니터링: nvidia-smi dmon -s mu -c 100 --id 0,1.

Python을 사용하여 vLLM Elyza 모델을 배포하고 상호 작용합니다: (필요한 경우에만 매개변수를 변경합니다. 이 명령은 이미 cloud-init 스크립트에 포함되어 있습니다.)

python -O -u -m vllm.entrypoints.api_server \
                --host 0.0.0.0 \
                --port 8000 \
                --model /home/opc/models/${MODEL} \
                --tokenizer hf-internal-testing/llama-tokenizer \
                --enforce-eager \
                --max-num-seqs 1 \
                --tensor-parallel-size 2 \
                >> /home/opc/${MODEL}.log 2>&1

작업 5: 모델 통합 테스트

명령 또는 Jupyter Notebook 세부정보를 사용하여 다음과 같은 방식으로 모델과 상호 작용합니다.

Cloud-init가 완료되면 CLI(명령행 인터페이스)에서 모델을 테스트합니다.

curl -X POST "http://0.0.0.0:8000/generate" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Write a humorous limerick about the wonders of GPU computing.", "max_tokens": 64, "temperature": 0.7, "top_p": 0.9}'

Jupyter Notebook에서 모델을 테스트합니다(8888 포트를 열어야 함).

import requests
import json

url = "http://0.0.0.0:8000/generate"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
}

data = {
    "prompt": "Write a short conclusion.",
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Pretty print the response for better readability
    formatted_response = json.dumps(result, indent=4)
    print("Response:", formatted_response)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

Gradio를 챗봇과 통합하여 모델 쿼리.

import requests
import gradio as gr
import os

# Function to interact with the model via API
def interact_with_model(prompt):
    url = 'http://0.0.0.0:8000/generate'
    headers = {
        "accept": "application/json",
        "Content-Type": "application/json",
    }

    data = {
        "prompt": prompt,
        "max_tokens": 64,
        "temperature": 0.7,
        "top_p": 0.9
    }

    response = requests.post(url, headers=headers, json=data)

    if response.status_code == 200:
        result = response.json()
        completion_text = result["text"][0].strip()  # Extract the generated text
        return completion_text
    else:
        return {"error": f"Request failed with status code {response.status_code}"}

# Retrieve the MODEL environment variable
model_name = os.getenv("MODEL")

# Example Gradio interface
iface = gr.Interface(
    fn=interact_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Write a prompt..."),
    outputs=gr.Textbox(type="text", placeholder="Response..."),
    title=f"{model_name} Interface",  # Use model_name to dynamically set the title
    description=f"Interact with the {model_name} deployed locally via Gradio.",  # Use model_name to dynamically set the description
    live=True
)

# Launch the Gradio interface
iface.launch(share=True)

작업 6: Docker를 사용하여 모델 배치(필요한 경우)

또는 캡슐화된 환경에 Docker를 사용하여 모델을 배포합니다.

외부 소스의 모델.

docker run --gpus all \
    --env "HUGGING_FACE_HUB_TOKEN=$TOKEN_ACCESS" \
    -p 8000:8000 \
    --ipc=host \
    --restart always \
    vllm/vllm-openai:latest \
    --tensor-parallel-size 2 \
    --model elyza/$MODEL 

이미 다운로드한 로컬 파일을 사용하여 도커로 실행되는 모델(더 빠르게 시작).

docker run --gpus all \
-v /home/opc/models/$MODEL/:/mnt/model/ \
--env "HUGGING_FACE_HUB_TOKEN=$TOKEN_ACCESS" \
-p 8000:8000 \
--env "TRANSFORMERS_OFFLINE=1" \
--env "HF_DATASET_OFFLINE=1" \
--ipc=host vllm/vllm-openai:latest \
--model="/mnt/model/" \
--tensor-parallel-size 2

다음과 같은 방법으로 모델을 질의할 수 있습니다.

CLI에서 Docker로 실행된 모델을 쿼리합니다(추가 주의 필요).

외부 소스의 Docker로 시작된 모델입니다.

(elyza) [opc@a10-2-gpu ~]$ curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
    "model": "elyza/'${MODEL}'",
    "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}'

모델이 Docker를 사용하여 로컬에서 시작되었습니다.

curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
    "model": "/mnt/model/",
    "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}'

Jupyter Notebook에서 Docker로 시작된 모델을 쿼리합니다.

모델이 Docker Hub에서 시작되었습니다.

import requests
import json
import os

url = "http://0.0.0.0:8000/v1/chat/completions"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
}

# Assuming `MODEL` is an environment variable set appropriately
model = f"elyza/{os.getenv('MODEL')}"

data = {
    "model": model,
    "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Extract the generated text from the response
    completion_text = result["choices"][0]["message"]["content"].strip()
    print("Generated Text:", completion_text)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

컨테이너가 Docker를 사용하여 로컬에서 시작되었습니다.

import requests
import json
import os

url = "http://0.0.0.0:8000/v1/chat/completions"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
}

# Assuming `MODEL` is an environment variable set appropriately
model = f"/mnt/model/"  # Adjust this based on your specific model path or name

data = {
    "model": model,
    "messages": [{"role": "user", "content": "Write a humorous limerick about the wonders of GPU computing."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    # Extract the generated text from the response
    completion_text = result["choices"][0]["message"]["content"].strip()
    print("Generated Text:", completion_text)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

챗봇과 통합된 Gradio를 사용하여 Docker로 시작한 모델을 쿼리합니다.

외부 소스의 Docker로 시작된 모델입니다.

import requests
import gradio as gr
import os

# Function to interact with the model via API
def interact_with_model(prompt):
    url = 'http://0.0.0.0:8000/v1/chat/completions'  # Update the URL to match the correct endpoint
    headers = {
        "accept": "application/json",
        "Content-Type": "application/json",
    }

    # Assuming `MODEL` is an environment variable set appropriately
    model = f"elyza/{os.getenv('MODEL')}"

    data = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],  # Use the user-provided prompt
        "max_tokens": 64,
        "temperature": 0.7,
        "top_p": 0.9
    }

    response = requests.post(url, headers=headers, json=data)

    if response.status_code == 200:
        result = response.json()
        completion_text = result["choices"][0]["message"]["content"].strip()  # Extract the generated text
        return completion_text
    else:
        return {"error": f"Request failed with status code {response.status_code}"}

# Retrieve the MODEL environment variable
model_name = os.getenv("MODEL")

# Example Gradio interface
iface = gr.Interface(
    fn=interact_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Write a prompt..."),
    outputs=gr.Textbox(type="text", placeholder="Response..."),
    title=f"{model_name} Interface",  # Use model_name to dynamically set the title
    description=f"Interact with the {model_name} model deployed locally via Gradio.",  # Use model_name to dynamically set the description
    live=True
)

# Launch the Gradio interface
iface.launch(share=True)

컨테이너가 Gradio를 사용하여 Docker에서 로컬로 시작되었습니다.

import requests
import gradio as gr
import os

# Function to interact with the model via API
def interact_with_model(prompt):
    url = 'http://0.0.0.0:8000/v1/chat/completions'  # Update the URL to match the correct endpoint
    headers = {
        "accept": "application/json",
        "Content-Type": "application/json",
    }

    # Assuming `MODEL` is an environment variable set appropriately
    model = "/mnt/model/"  # Adjust this based on your specific model path or name

    data = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 64,
        "temperature": 0.7,
        "top_p": 0.9
    }

    response = requests.post(url, headers=headers, json=data)

    if response.status_code == 200:
        result = response.json()
        completion_text = result["choices"][0]["message"]["content"].strip()
        return completion_text
    else:
        return {"error": f"Request failed with status code {response.status_code}"}

# Example Gradio interface
iface = gr.Interface(
    fn=interact_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Write a humorous limerick about the wonders of GPU computing."),
    outputs=gr.Textbox(type="text", placeholder="Response..."),
    title="Model Interface",  # Set your desired title here
    description="Interact with the model deployed locally via Gradio.",
    live=True
)

# Launch the Gradio interface
iface.launch(share=True)

주: 방화벽 명령을 사용하여 Jupyter Notebook용 8888 포트를 엽니다.
sudo firewall-cmd --zone=public --permanent --add-port 8888/tcp
sudo firewall-cmd --reload
sudo firewall-cmd --list-all

확인

작성자 - Bogdan Bazarca(Senior Cloud Engineer)
제공자 - Oracle NACI-AI-CN-DEV 팀

추가 학습 자원

docs.oracle.com/learn에서 다른 실습을 탐색하거나 Oracle Learning YouTube 채널에서 더 많은 무료 학습 콘텐츠에 액세스하세요. 또한 Oracle Learning Explorer가 되려면 education.oracle.com/learning-explorer을 방문하십시오.

제품 설명서는 Oracle Help Center를 참조하십시오.

제목 및 저작권 정보

Run Elyza LLM Model on OCI Compute A10.2 Instance with Oracle Resource Manager using One Click Deployment

G11828-01

July 2024

Oracle and/or its affiliates.