Deploy the LLM

To deploy the LLM, you must first create a BM.GPU.MI300X.8 instance in OCI.

Then you can use the OCI Block Volumes service to store data, objects, and unstructured model data. Follow the instructions to complete each task:

  1. Create an Instance
  2. Create a Block Volume
  3. Attach a Block Volume to an Instance
  4. Connect to a Block Volume
  5. Create an OCI Object Storage Bucket

This will deploy a model from OCI Object Storage to an OKE cluster running on OCI.

Create an OKE Cluster

Create an OKE cluster or use a console to configure and create an OKE cluster with a managed node.

To create an OKE cluster, use the following command:

oci ce cluster create --compartment-id ocid1.compartment.oc1..aaaaaaaay______t6q
      --kubernetes-version v1.24.1 --name amd-mi300x-ai-cluster --vcn-id
      ocid1.vcn.oc1.iad.aaaaaae___yja

To use the console option, follow these steps:

  1. Use this command for the console option. You can create a managed nodepool once cluster is created using the following command:
    oci ce node-pool create --cluster-id <cluster-ocid> --compartment-id <compartment-ocid> --name <node-pool-name> --node-shape <shape>
  2. After you set up the cluster and the required access to the cluster, install ROCm using the following instructions (example for Oracle Linux):
    sudo dnf install https://repo.radeon.com/amdgpu-install/6.4/el/9.5/amdgpu-install-6.4.60400-1.el9.noarch.rpm
    sudo dnf clean all
    wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
    sudo rpm -ivh epel-release-latest-9.noarch.rpm
    sudo dnf install dnf-plugin-config-manager
    sudo crb enable
    sudo dnf install python3-setuptools python3-wheel
    sudo usermod -a -G render,video $LOGNAME
  3. Add the current user to the render and video:
    groupssudo dnf install rocm
  4. Install AMD GPU driver using the following command:
    sudo dnf install https://repo.radeon.com/amdgpu-install/6.4/el/9.5/amdgpu-install-6.4.60400-1.el9.noarch.rpm
    sudo dnf clean all
    sudo dnf install "kernel-uek-devel-$(uname -r)"
    sudo dnf install amdgpu-dkms

Use vLLM Features in ROCm

Follow these steps to use vLLM features in ROCm:
  1. To use the vLLM features in ROCm, clone the Docker image and build the container with the following command:
    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    DOCKER_BUILDKIT=1
    docker build -f Dockerfile.rocm -t vllm-rocm .
  2. After you create the container, you can test it by running a Hugging Face model and replacing <path/to/model> to point to the OCI Object Storage bucket where you downloaded the from using the following command:
    docker run -it --network=host --group-add=video --ipc=host --cap-
    add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri -v <path/to/model>:/app/models vllm-rocm

Serve LLM

Import the llm and SamplingParams classes for offline inferencing with a batch of prompts. You can then load and call the model.

The following is an example of a Meta Llama 3 70B model that needs multiple GPUs to run with tensor parallelism. vLLM uses the Megatron-LM's tensor parallelism algorithm and Python's multiprocessing to manage the distributed runtime on single nodes.

  1. Use the following command to serve the LLM model inference:
    vllm serve --model="meta-llama/Meta-Llama-3-70B-Instruct" --tensor-parallel-size 4--distributed-executor-backend=m
  2. Use the following command to serve the model:
    vllm serve meta-llama/Meta-Llama-3-70B-Instruct
  3. To query the model, use the following curl command:
    curl http://localhost:8000/v1/completions \
                -H "Content-Type: application/json" \
                -d '{
                    "model": "Qwen/Qwen2-7B-Instruct",
                    "prompt": "Write a haiku about artificial intelligence",
                    "max_tokens": 128,
                    "top_p": 0.95,
                    "top_k": 20,
                    "temperature": 0.8
                    }'