Skip to content

Latest commit

 

History

History
230 lines (160 loc) · 9.55 KB

inference.md

File metadata and controls

230 lines (160 loc) · 9.55 KB

Inference and Deployment

This repository offers multiple options for inference and deployment, including Google Colab notebooks, Gradio demos, FastChat, and vLLM. It also offers guidance on conducting experiments using llama.cpp on your personal computer.

Quantized Models

The quantized versions of certain models are generously provided by TheBloke!

These versions facilitate testing and development with various popular frameworks, including AutoAWQ, vLLM, AutoGPTQ, GPTQ-for-LLaMa, llama.cpp, text-generation-webui, and more.

You can find these models on the Hugging Face Hub.

Google Colab Notebook

You can utilize the Google Colab Notebook below for inferring with the Vigogne instruction-following models.

Open In Colab

For the Vigogne-Chat models, please refer to this notebook.

Open In Colab

Gradio Demo

To launch a Gradio demo in streaming mode and interact with the Vigogne instruction-following models, execute the command given below:

python vigogne/inference/gradio/demo_instruct.py --base_model_name_or_path bofenghuang/vigogne-2-7b-instruct

For the Vigogne-Chat models, follow this command:

python vigogne/inference/gradio/demo_chat.py --base_model_name_or_path bofenghuang/vigogne-2-7b-chat

llama.cpp

The Vigogne models can now be easily deployed on PCs with the help of tools created by the community. The following instructions provide a detailed guide on how to combine Vigogne LoRA weights with the original LLaMA model, using Vigogne-2-7B-Instruct as an example. Additionally, you will learn how to quantize the resulting model to 4-bit and deploy it on your own PC using llama.cpp. For French-speaking users, you can refer to this excellent tutorial provided by @pereconteur.

Note: the models will be quantized into 4-bit, so the performance might be worse than the non-quantized version. The responses are random due to the generation hyperparameters.

Please ensure that the following requirements are met prior to running:

  • As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. You will need at least 13GB of RAM to quantize the 7B model. For more information, refer to this link.
  • It's best to use Python 3.9 or Python 3.10, as sentencepiece has not yet published a wheel for Python 3.11.

1. Convert the Vigogne model to the original LLaMA format

# convert the Vigogne model from Hugging Face's format to the original LLaMA format
python scripts/export_state_dict_checkpoint.py \
    --base_model_name_or_path bofenghuang/vigogne-2-7b-instruct \
    --output_dir ./models/vigogne_2_7b_instruct
    --base_model_size 7B

# download the tokenizer.model file
wget -P ./models https://huggingface.co/bofenghuang/vigogne-2-7b-instruct/resolve/main/tokenizer.model

# check the files
tree models
# models
# ├── vigogne_2_7b_instruct
# │   ├── consolidated.00.pth
# │   └── params.json
# └── tokenizer.model

2. Clone and build llama.cpp repo

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# make with blas
# see https://github.com/ggerganov/llama.cpp#blas-build

3. Quantize the model

# convert the 7B model to ggml FP16 format
python convert.py path/to/vigogne/models/vigogne_2_7b_instruct

# quantize the model to 4-bits (using q4_0 method)
./quantize path/to/vigogne/models/vigogne_2_7b_instruct/ggml-model-f16.bin path/to/vigogne/models/vigogne_2_7b_instruct/ggml-model-q4_0.bin q4_0

4. Run the inference

# ./main -h for more information
./main -m path/to/vigogne/models/vigogne_2_7b_instruct/ggml-model-q4_0.bin --color -f path/to/vigogne/prompts/instruct.txt -ins -c 2048 -n 256 --temp 0.1 --repeat_penalty 1.1

For the Vigogne-Chat models, the previous steps for conversion and quantization remain the same. However, the final step requires a different command to run the inference.

./main -m path/to/vigogne/models/vigogne_2_7b_chat/ggml-model-q4_0.bin --color -f path/to/vigogne/prompts/chat.txt --reverse-prompt "<|user|>:" --in-prefix " " --in-suffix "<|assistant|>:" --interactive-first -c 2048 -n -1 --temp 0.1

FastChat

FastChat is an open platform for training, serving, and evaluating large language model based chatbots. As Vigogne models are now integrated into the FastChat library (as seen in supported models), you can leverage its capabilities for serving the model. Below is an example of how to perform inference using the command line interface:

# First need to install FastChat
# pip install "fschat[model_worker,webui]"

# Infer Vigogne-Instruct models
# python -m fastchat.serve.cli --model bofenghuang/vigogne-2-7b-instruct

# Infer Vigogne-Chat models
python -m fastchat.serve.cli --model bofenghuang/vigogne-2-7b-chat

vLLM

vLLM is an open-source library for fast LLM inference and serving, enhanced with PagedAttention. Additionally, it offers a server that mimics the OpenAI API protocol, enabling it to be used as a drop-in replacement for applications using OpenAI API.

To set up an OpenAI-compatible server, please utilize the following command:

# Install vLLM
# This may take 5-10 minutes.
# pip install vllm

# Start server for Vigogne-Instruct models
# python -m vllm.entrypoints.openai.api_server --model bofenghuang/vigogne-2-7b-instruct

# Start server for Vigogne-Chat models
python -m vllm.entrypoints.openai.api_server --model bofenghuang/vigogne-2-7b-chat

# List models
# curl http://localhost:8000/v1/models

Then you can query the model using the openai python package:

import openai

# Modify OpenAI's API key and API base to use vLLM's API server.
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

# First model
models = openai.Model.list()
model = models["data"][0]["id"]

# Chat completion API
chat_completion = openai.ChatCompletion.create(
    model=model,
    messages=[
        {"role": "user", "content": "Parle-moi de toi-même."},
    ],
    max_tokens=1024,
    temperature=0.7,
)
print("Chat completion results:", chat_completion)