Note:

This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.

Deploy Llama2 on Oracle Cloud Infrastructure GPUs

Introduction

LLAMA2 is a state-of-the-art deep learning architecture designed to scale machine learning models efficiently on resource-constrained devices. The platform is incredibly scalable and adaptable, allowing organizations to process enormous amounts of data with ease, extract meaningful insights, and react rapidly to shifting market conditions. One of the key features of Llama2 is its ability to process data in near real-time, which enables businesses to respond promptly to changes in their environment. Additionally, it supports a variety of data sources and processing frameworks, including Apache Kafka, Apache Flink, Apache Spark, and so on. This means that developers can choose the best tools for their specific needs.

Llama2 also offers a number of other useful features, such as support for streaming SQL, integration with popular big data tools like Hadoop and YARN, and strong security measures to ensure data privacy and integrity. Overall, Llama2 is a powerful tool that can help organizations efficiently process and analyze large datasets, giving them a competitive edge in today’s fast-paced digital landscape.

The architecture consists of several components working together to generate human-like responses. At the core of the model is the transformer encoder, which takes in a sequence of words or text and outputs a series of vectors representing the input. These vectors are then passed through a Feedforward Neural Network (FFNN) and a layer of residual connections to generate the final output.

The FFNN is composed of fully connected layers that process the input sequences and produce contextualized embeddings. The residual connections allow the model to learn more complex patterns in the data and improve its overall performance.

In addition to these core components, Llama LLM also includes several other modules to help fine-tune the model and improve its accuracy. These include a tokenizer to convert text inputs into numerical tokens, a vocabulary to store learned subwords, and a mask to prevent exposure bias during training.

Text Generation WebUI is an gradio Web UI for Large Language Models. It supports API and Command-line tools as well if they are your thing. This WebUI supports multiple model backends which include transformers, llama.cpp, ExLlama, ExLlamaV2, AutoGPTQ, GPTQ-for-Llama,CTransformers and AutoAWQ. Also supports LoRA models, fine-tuning, training a new LoRA using QLoRA. It has extensions framework to load your favorite extensions for your models. It has OpenAI compatible API server.

Objectives

We will install everything needed to run the WebUI and load Llama2 models to run text generation.

Prerequisites

Oracle Cloud Infrastructure (OCI) tenancy with A10 GPU limits.
An existing OCI Virtual Cloud Network (VCN) with atleast one public subnet and limits for public IP.
A Llama2 Model checkpoint from your favorite Huggingface Creator. Make sure your model supports the backends mentioned above.

Task 1: Provision an GPU compute instance on OCI

Launch a compute instance on OCI with existing VCN with public subnet. For more information, see Launch Compute Instance.
Choose one from these available GPU.A10 shapes.
```
VM.GPU.A10.1
VM.GPU.A10.2
BM.GPU.A10.4
```
When launching a compute instance change shape to one of the above shapes.
If your tenancy does not have a service limit set for GPU.A10, these shapes will not be in the shape list.
- To check your tenancy limits in the OCI Console, set the region where you are going to provision a GPU.A10 compute instance, open the navigation menu and click Governance & Administration.
- Under Tenancy Management, select Limits, Quotas and Usage.
- Select the Service to Compute, select one of availability domains in the scope field, and type GPU.A10 in the resource field.
- Select GPUs for A10 based VM and BM instances
Compute limits are per availability domain. Check if the limit is set in any of availability domains of the region. If the service limit is set to 0 for all availability domains, click the request a service limit increase link and submit a limit increase request for this resource.

Note: In order to access Limits, Quotas and Usage you must be a member of the tenancy administrators group or your group must have a policy assigned to read LimitsAndUsageViewers.
- For more information about service limits, see Service Limits.
Currently OCI GPU.A10 compute shapes support Oracle Linux, Ubuntu and Rocky Linux. Windows is supported by VM shapes only.

Note: Rocky Linux is not officially supported by NVIDIA.
When provisioning a compute instance on OCI, use a standard OS image or GPU enabled Image. If you go with standard OS image, it requires NVIDIA vGPU driver to be installed.
Expand boot volume section to increase the boot volume to atleast 250GB and increase VPU to Higher Performance to get a decent read/write for better inferencing.

Launch the instance with above parameters.

Task 2: Install Prerequisites for Llama2

As NVIDIA drivers are included in the Oracle Linux GPU build image, we can verify their presence and functionality by running the nvidia-smi command. This will ensure that everything is properly set up and the GPU drivers are functioning as expected.
Grow FileSystem: OCI instances system memory comes with 46.6GB default. Since we increased our boot volume to 300GB, let us grow our file system with in built OCI command from OCI Util.

To expand the file system on your OCI instance after increasing the boot volume to 300GB, you can use the built-in OCI command. Follow the below steps.
- Check the current disk usage: Before resizing the file system, it is a good practice to check the current disk usage to ensure it reflects the increased boot volume size. You can use the df command for this purpose. Verify that the available space matches your new boot volume size (300GB).
```
bash df -h
```
- Resize the File System: Use the OCI utility command to resize the file system to make use of the increased storage. The exact command can vary depending on the specific OS and file system you are using. The below is used for Oracle Linux 8.
```
sudo /usr/libexec/oci-growfs
```
  Enter y when asked to confirm that you are extending partition.
- Verify the File System Expansion: After running the resize command, check the disk usage again to confirm that the file system has been successfully expanded.
```
bash df -h
```
It should now reflect the increased file system size. By following these steps, you should be able to expand the file system on your OCI instance to utilize the additional storage space provided by the increased boot volume.

Install Python 3.10.6 on Oracle Linux 8 using the following command.

sudo dnf update -y
sudo dnf install curl gcc openssl-devel bzip2-devel libffi-devel zlib-devel wget make -y
wget https://www.python.org/ftp/python/3.10.6/Python-3.10.6.tar.xz
tar -xf Python-3.10.6.tar.xz
cd Python-3.10.6/
./configure --enable-optimizations
make -j 2
nproc
sudo make altinstall
python3.10 -V

Install git to clone git repository.
```
sudo dnf install git
```

Install conda using the following command.

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash

Create conda environment.

conda create -n llama2 python=3.10.9 # llama2 being the conda environment name
conda activate llama2

Install PyTorch 2.0 using the following command.

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Clone text-generation-webui. You should have the text-generation-webui repository in your cloned directory.
```
git clone https://github.com/oobabooga/text-generation-webui
```
Install requirements.txt and Change directory to text-generation-webui and run the following command.
```
pip3 install -r requirements.txt
```

Update firewall rules to allow port 7860 traffic.

sudo firewall-cmd --list-all # To check existing added ports
sudo firewall-cmd --zone=public --permanent --add-port 7860/tcp
sudo firewall-cmd --reload
sudo firewall-cmd --list-all # Confirm that port has been added.

Task 3: Run Llama2

With prerequisites successfully installed, we are ready to move forward with running text-generation-webui. Navigate to the text-generation-webui directory and run the following command.
```
python server.py --sdp-attention --listen
```
The outcome of this process should load the essential modules and launch the inference server on port 7860.
After confirming a successful deployment with the server running on port 7860 as demonstrated above, let’s proceed to access the text-generation-webui application. Open your web browser and enter the following address: http://<PublicIP>:7860, replacing <PublicIP> with the instance’s public IP address.

Now, the application should load and appear as illustrated below. Navigate to the section model on the top as highlighted.
In the Model section, enter huggingface repository for your desired Llama2 model. For our purposes, we selected GPTQ model from the huggingface repo TheBloke/Llama-2-13B-chat-GPTQ. Download the model and load it in the model section.
Once you load it, navigate to the Chat section to start text generation with Llama2.

Task 4: Deploy Text Generation WebUI via Service Manager `systemctl`

Create a file llama2.service in the path /etc/systemd/system and enter the following text.

[Unit]
Description=systemd service start llama2

[Service]
WorkingDirectory=/home/opc/text-generation-webui
ExecStart=bash /home/opc/text-generation-webui/start.sh
User=opc
[Install]
WantedBy=multi-user.target

Make sure to change your working directory. Here we mentioned start.sh as execution file, let’s create that file in the text-generation-webui directory and enter the following text.

#!/bin/sh
# - the script is ran via anaconda python
# - output is not buffered (-u)
# - stdout & stderr are stored to a log file
# - we're executing the working directory specified in the systemd service file
/home/opc/miniconda3/envs/llama2/bin/python server.py --sdp-attention --listen

This makes sure to always use your conda environment Llama2. Activating conda environment from systemd service is not desirable. Hence we use shell script to start and then use shell script to run the app. Run the following commands to reload and enable/start the service.
```
sudo systemctl daemon-reload
sudo systemctl enable llama2.service
sudo systemctl start llama2.service
```
Run the below command to check the status of the service.
```
sudo systemctl start llama2.service
```

Acknowledgments

Author - Abhiram Ampabathina (Senior Cloud Architect)

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.

Title and Copyright Information

Deploy Llama2 on Oracle Cloud Infrastructure GPUs

F91893-01

January 2024

Deploy Llama2 on Oracle Cloud Infrastructure GPUs

Introduction

Objectives

Prerequisites

Task 1: Provision an GPU compute instance on OCI

Task 2: Install Prerequisites for Llama2

Task 3: Run Llama2

Task 4: Deploy Text Generation WebUI via Service Manager systemctl

Related Links

Acknowledgments

More Learning Resources

Task 4: Deploy Text Generation WebUI via Service Manager `systemctl`