Note:
- This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
- It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.
Deploy Llama2 on Oracle Cloud Infrastructure GPUs
Introduction
LLAMA2 is a state-of-the-art deep learning architecture designed to scale machine learning models efficiently on resource-constrained devices. The platform is incredibly scalable and adaptable, allowing organizations to process enormous amounts of data with ease, extract meaningful insights, and react rapidly to shifting market conditions. One of the key features of Llama2 is its ability to process data in near real-time, which enables businesses to respond promptly to changes in their environment. Additionally, it supports a variety of data sources and processing frameworks, including Apache Kafka, Apache Flink, Apache Spark, and so on. This means that developers can choose the best tools for their specific needs.
Llama2 also offers a number of other useful features, such as support for streaming SQL, integration with popular big data tools like Hadoop and YARN, and strong security measures to ensure data privacy and integrity. Overall, Llama2 is a powerful tool that can help organizations efficiently process and analyze large datasets, giving them a competitive edge in today’s fast-paced digital landscape.
The architecture consists of several components working together to generate human-like responses. At the core of the model is the transformer encoder, which takes in a sequence of words or text and outputs a series of vectors representing the input. These vectors are then passed through a Feedforward Neural Network (FFNN) and a layer of residual connections to generate the final output.
The FFNN is composed of fully connected layers that process the input sequences and produce contextualized embeddings. The residual connections allow the model to learn more complex patterns in the data and improve its overall performance.
In addition to these core components, Llama LLM also includes several other modules to help fine-tune the model and improve its accuracy. These include a tokenizer to convert text inputs into numerical tokens, a vocabulary to store learned subwords, and a mask to prevent exposure bias during training.
Text Generation WebUI is an gradio Web UI for Large Language Models. It supports API and Command-line tools as well if they are your thing. This WebUI supports multiple model backends which include transformers, llama.cpp, ExLlama, ExLlamaV2, AutoGPTQ, GPTQ-for-Llama,CTransformers and AutoAWQ. Also supports LoRA models, fine-tuning, training a new LoRA using QLoRA. It has extensions framework to load your favorite extensions for your models. It has OpenAI compatible API server.
Objectives
- We will install everything needed to run the WebUI and load Llama2 models to run text generation.
Prerequisites
-
Oracle Cloud Infrastructure (OCI) tenancy with A10 GPU limits.
-
An existing OCI Virtual Cloud Network (VCN) with atleast one public subnet and limits for public IP.
-
A Llama2 Model checkpoint from your favorite Huggingface Creator. Make sure your model supports the backends mentioned above.
Task 1: Provision an GPU compute instance on OCI
-
Launch a compute instance on OCI with existing VCN with public subnet. For more information, see Launch Compute Instance.
-
Choose one from these available GPU.A10 shapes.
VM.GPU.A10.1 VM.GPU.A10.2 BM.GPU.A10.4
-
When launching a compute instance change shape to one of the above shapes.
-
If your tenancy does not have a service limit set for GPU.A10, these shapes will not be in the shape list.
-
To check your tenancy limits in the OCI Console, set the region where you are going to provision a GPU.A10 compute instance, open the navigation menu and click Governance & Administration.
-
Under Tenancy Management, select Limits, Quotas and Usage.
-
Select the Service to
Compute
, select one of availability domains in the scope field, and typeGPU.A10
in the resource field. -
Select GPUs for A10 based VM and BM instances
-
-
Compute limits are per availability domain. Check if the limit is set in any of availability domains of the region. If the service limit is set to
0
for all availability domains, click the request a service limit increase link and submit a limit increase request for this resource.Note: In order to access Limits, Quotas and Usage you must be a member of the tenancy administrators group or your group must have a policy assigned to read LimitsAndUsageViewers.
- For more information about service limits, see Service Limits.
-
Currently OCI GPU.A10 compute shapes support Oracle Linux, Ubuntu and Rocky Linux. Windows is supported by VM shapes only.
Note: Rocky Linux is not officially supported by NVIDIA.
-
When provisioning a compute instance on OCI, use a standard OS image or GPU enabled Image. If you go with standard OS image, it requires NVIDIA vGPU driver to be installed.
-
Expand boot volume section to increase the boot volume to atleast 250GB and increase VPU to Higher Performance to get a decent read/write for better inferencing.
Launch the instance with above parameters.
Task 2: Install Prerequisites for Llama2
-
As NVIDIA drivers are included in the Oracle Linux GPU build image, we can verify their presence and functionality by running the
nvidia-smi
command. This will ensure that everything is properly set up and the GPU drivers are functioning as expected. -
Grow FileSystem: OCI instances system memory comes with 46.6GB default. Since we increased our boot volume to 300GB, let us grow our file system with in built OCI command from OCI Util.
To expand the file system on your OCI instance after increasing the boot volume to 300GB, you can use the built-in OCI command. Follow the below steps.
-
Check the current disk usage: Before resizing the file system, it is a good practice to check the current disk usage to ensure it reflects the increased boot volume size. You can use the
df
command for this purpose. Verify that the available space matches your new boot volume size (300GB).bash df -h
-
Resize the File System: Use the OCI utility command to resize the file system to make use of the increased storage. The exact command can vary depending on the specific OS and file system you are using. The below is used for Oracle Linux 8.
sudo /usr/libexec/oci-growfs
Enter
y
when asked to confirm that you are extending partition. -
Verify the File System Expansion: After running the resize command, check the disk usage again to confirm that the file system has been successfully expanded.
bash df -h
It should now reflect the increased file system size. By following these steps, you should be able to expand the file system on your OCI instance to utilize the additional storage space provided by the increased boot volume.
-
-
Install
Python 3.10.6
on Oracle Linux 8 using the following command.sudo dnf update -y sudo dnf install curl gcc openssl-devel bzip2-devel libffi-devel zlib-devel wget make -y wget https://www.python.org/ftp/python/3.10.6/Python-3.10.6.tar.xz tar -xf Python-3.10.6.tar.xz cd Python-3.10.6/ ./configure --enable-optimizations make -j 2 nproc sudo make altinstall python3.10 -V
-
Install git to clone git repository.
sudo dnf install git
-
Install
conda
using the following command.mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm -rf ~/miniconda3/miniconda.sh ~/miniconda3/bin/conda init bash
-
Create
conda
environment.conda create -n llama2 python=3.10.9 # llama2 being the conda environment name conda activate llama2
-
Install
PyTorch 2.0
using the following command.pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-
Clone
text-generation-webui
. You should have thetext-generation-webui
repository in your cloned directory.git clone https://github.com/oobabooga/text-generation-webui
-
Install
requirements.txt
and Change directory totext-generation-webui
and run the following command.pip3 install -r requirements.txt
-
Update firewall rules to allow port
7860
traffic.sudo firewall-cmd --list-all # To check existing added ports sudo firewall-cmd --zone=public --permanent --add-port 7860/tcp sudo firewall-cmd --reload sudo firewall-cmd --list-all # Confirm that port has been added.
Task 3: Run Llama2
-
With prerequisites successfully installed, we are ready to move forward with running
text-generation-webui
. Navigate to thetext-generation-webui
directory and run the following command.python server.py --sdp-attention --listen
The outcome of this process should load the essential modules and launch the inference server on port
7860
. -
After confirming a successful deployment with the server running on port
7860
as demonstrated above, let’s proceed to access thetext-generation-webui
application. Open your web browser and enter the following address:http://<PublicIP>:7860
, replacing<PublicIP>
with the instance’s public IP address.Now, the application should load and appear as illustrated below. Navigate to the section model on the top as highlighted.
-
In the Model section, enter huggingface repository for your desired Llama2 model. For our purposes, we selected GPTQ model from the huggingface repo
TheBloke/Llama-2-13B-chat-GPTQ
. Download the model and load it in the model section. -
Once you load it, navigate to the Chat section to start text generation with Llama2.
Task 4: Deploy Text Generation WebUI via Service Manager systemctl
-
Create a file
llama2.service
in the path/etc/systemd/system
and enter the following text.[Unit] Description=systemd service start llama2 [Service] WorkingDirectory=/home/opc/text-generation-webui ExecStart=bash /home/opc/text-generation-webui/start.sh User=opc [Install] WantedBy=multi-user.target
-
Make sure to change your working directory. Here we mentioned
start.sh
as execution file, let’s create that file in thetext-generation-webui
directory and enter the following text.#!/bin/sh # - the script is ran via anaconda python # - output is not buffered (-u) # - stdout & stderr are stored to a log file # - we're executing the working directory specified in the systemd service file /home/opc/miniconda3/envs/llama2/bin/python server.py --sdp-attention --listen
-
This makes sure to always use your conda environment Llama2. Activating conda environment from systemd service is not desirable. Hence we use shell script to start and then use shell script to run the app. Run the following commands to reload and enable/start the service.
sudo systemctl daemon-reload sudo systemctl enable llama2.service sudo systemctl start llama2.service
-
Run the below command to check the status of the service.
sudo systemctl start llama2.service
Related Links
Acknowledgments
- Author - Abhiram Ampabathina (Senior Cloud Architect)
More Learning Resources
Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.
For product documentation, visit Oracle Help Center.
Deploy Llama2 on Oracle Cloud Infrastructure GPUs
F91893-01
January 2024
Copyright © 2024, Oracle and/or its affiliates.