RESTful-LLaMa-3-8B-app

A simple RESTful service for the Meta-Llama-3-8B-Instruct language model.

Pre-requisites

A CUDA enabled GPU machine with at least 24 GB of RAM
Access to LLaMa-3 weights from Huggingface

Getting Started

Install Docker on the machine https://docs.docker.com/engine/install/ubuntu/
Check CUDA and NVIDIA Driver versions (Important for the base Docker image)
1. Run this on your terminal to check the CUDA version: nvcc --version
2. Run this on your terminal to check the driver version: nvidia-smi
Install NVIDIA Container Tool Kit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Installation

Clone this repo to your GPU machine.
Adapt the base cuda image in the Dockerfile based on the installation on your own machine.
Clone the LLaMa-3 weights from Huggingface into /your/path/to/data/models/ if you want to store the weights locally. Your HF token with write permissions is needed in this step.
(Optional) Change the number of workers in start_app.sh if you want to enable multiple workers to handle simultaneous requests. Keep in mind that each worker loads the model into their own memory, so, one needs approximately 20GB * number of workers of GPU RAM available.
In the app/ folder, run docker build -t restful-llama-3 . to build the Docker image.

How to start

Run the following command to start the Docker container. Configure the run options as desired. It takes a couple of minutes for the container to start and load the model.

docker run --gpus all -d -it -p 5000:5000 -v /your/path/to/data:/restful-llama-3/data -e GRANT_SUDO=yes --user root --restart always --name restful-llama-3 restful-llama-3

How to use

If the container runs with no problems, you should see a welcome message generated by the model on http://localhost:5000/home.

For interacting with the model, you need to send POST requests to http://localhost:5000/chat.

Here is an example with curl:

curl -X POST http://localhost:5000/chat -H 'Content-Type: application/json' -d '{"messages":[{"role":"system","content":"You are a helpful assistant called Llama-3. Write out your answer short and succinct!"}, {"role":"user", "content":"What is the capital of Germany?"}], "temperature": 0.6, "top_p":0.75, "max_new_tokens":256}'

Another simplified example:

curl -X POST http://localhost:5000/chat -H 'Content-Type: application/json' -d '{"messages":[{"role":"user", "content":"Write a short essay about Istanbul."}]}'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RESTful-LLaMa-3-8B-app

Pre-requisites

Getting Started

Installation

How to start

How to use

Files

README.md

Latest commit

History

README.md

File metadata and controls

RESTful-LLaMa-3-8B-app

Pre-requisites

Getting Started

Installation

How to start

How to use