Deploy the LLM

To deploy the LLM, you must first create a BM.GPU.MI300X.8 instance in OCI.

Then you can use the OCI Block Volumes service to store data, objects, and unstructured model data. Follow the instructions to complete each task:

This will deploy a model from OCI Object Storage to an OKE cluster running on OCI.

Create an OKE Cluster

Create an OKE cluster or use a console to configure and create an OKE cluster with a managed node.

To create an OKE cluster, use the following command:

oci ce cluster create --compartment-id ocid1.compartment.oc1..aaaaaaaay______t6q
      --kubernetes-version v1.24.1 --name amd-mi300x-ai-cluster --vcn-id
      ocid1.vcn.oc1.iad.aaaaaae___yja

To use the console option, follow these steps:

Use this command for the console option. You can create a managed nodepool once cluster is created using the following command:
```
oci ce node-pool create --cluster-id <cluster-ocid> --compartment-id <compartment-ocid> --name <node-pool-name> --node-shape <shape>
```

After you set up the cluster and the required access to the cluster, install ROCm using the following instructions (example for Oracle Linux):

sudo dnf install https://repo.radeon.com/amdgpu-install/6.4/el/9.5/amdgpu-install-6.4.60400-1.el9.noarch.rpm
sudo dnf clean all
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
sudo rpm -ivh epel-release-latest-9.noarch.rpm
sudo dnf install dnf-plugin-config-manager
sudo crb enable
sudo dnf install python3-setuptools python3-wheel
sudo usermod -a -G render,video $LOGNAME

Add the current user to the render and video:
```
groupssudo dnf install rocm
```

Install AMD GPU driver using the following command:

sudo dnf install https://repo.radeon.com/amdgpu-install/6.4/el/9.5/amdgpu-install-6.4.60400-1.el9.noarch.rpm
sudo dnf clean all
sudo dnf install "kernel-uek-devel-$(uname -r)"
sudo dnf install amdgpu-dkms

Use vLLM Features in `ROCm`

Follow these steps to use vLLM features in ROCm:

To use the vLLM features in ROCm, clone the Docker image and build the container with the following command:

git clone https://github.com/vllm-project/vllm.git
cd vllm
DOCKER_BUILDKIT=1
docker build -f Dockerfile.rocm -t vllm-rocm .

After you create the container, you can test it by running a Hugging Face model and replacing <path/to/model> to point to the OCI Object Storage bucket where you downloaded the from using the following command:
```
docker run -it --network=host --group-add=video --ipc=host --cap-
add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri -v <path/to/model>:/app/models vllm-rocm
```

Serve LLM

Import the llm and SamplingParams classes for offline inferencing with a batch of prompts. You can then load and call the model.

The following is an example of a Meta Llama 3 70B model that needs multiple GPUs to run with tensor parallelism. vLLM uses the Megatron-LM's tensor parallelism algorithm and Python's multiprocessing to manage the distributed runtime on single nodes.

Use the following command to serve the LLM model inference:

vllm serve --model="meta-llama/Meta-Llama-3-70B-Instruct" --tensor-parallel-size 4--distributed-executor-backend=m

Use the following command to serve the model:

vllm serve meta-llama/Meta-Llama-3-70B-Instruct

To query the model, use the following curl command:

curl http://localhost:8000/v1/completions \
            -H "Content-Type: application/json" \
            -d '{
                "model": "Qwen/Qwen2-7B-Instruct",
                "prompt": "Write a haiku about artificial intelligence",
                "max_tokens": 128,
                "top_p": 0.95,
                "top_k": 20,
                "temperature": 0.8
                }'

Deploy the LLM

Create an OKE Cluster

Use vLLM Features in ROCm

Serve LLM

Use vLLM Features in `ROCm`