Skip to content

Latest commit

 

History

History
45 lines (27 loc) · 2.61 KB

README.md

File metadata and controls

45 lines (27 loc) · 2.61 KB

RESTful-LLaMa-3-8B-app

A simple RESTful service for the Meta-Llama-3-8B-Instruct language model.

Pre-requisites

  1. A CUDA enabled GPU machine with at least 24 GB of RAM
  2. Access to LLaMa-3 weights from Huggingface

Getting Started

  1. Install Docker on the machine https://docs.docker.com/engine/install/ubuntu/
  2. Check CUDA and NVIDIA Driver versions (Important for the base Docker image)
    1. Run this on your terminal to check the CUDA version: nvcc --version
    2. Run this on your terminal to check the driver version: nvidia-smi
  3. Install NVIDIA Container Tool Kit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Installation

  1. Clone this repo to your GPU machine.
  2. Adapt the base cuda image in the Dockerfile based on the installation on your own machine.
  3. Clone the LLaMa-3 weights from Huggingface into /your/path/to/data/models/ if you want to store the weights locally. Your HF token with write permissions is needed in this step.
  4. (Optional) Change the number of workers in start_app.sh if you want to enable multiple workers to handle simultaneous requests. Keep in mind that each worker loads the model into their own memory, so, one needs approximately 20GB * number of workers of GPU RAM available.
  5. In the app/ folder, run docker build -t restful-llama-3 . to build the Docker image.

How to start

Run the following command to start the Docker container. Configure the run options as desired. It takes a couple of minutes for the container to start and load the model.

docker run --gpus all -d -it -p 5000:5000 -v /your/path/to/data:/restful-llama-3/data -e GRANT_SUDO=yes --user root --restart always --name restful-llama-3 restful-llama-3

How to use

If the container runs with no problems, you should see a welcome message generated by the model on http://localhost:5000/home.

For interacting with the model, you need to send POST requests to http://localhost:5000/chat.

Here is an example with curl:

curl -X POST http://localhost:5000/chat -H 'Content-Type: application/json' -d '{"messages":[{"role":"system","content":"You are a helpful assistant called Llama-3. Write out your answer short and succinct!"}, {"role":"user", "content":"What is the capital of Germany?"}], "temperature": 0.6, "top_p":0.75, "max_new_tokens":256}'

Another simplified example:

curl -X POST http://localhost:5000/chat -H 'Content-Type: application/json' -d '{"messages":[{"role":"user", "content":"Write a short essay about Istanbul."}]}'