Ray vLLM Interence

A service that integrates vLLM with Ray Serve for fast and scalable LLM serving.

vLLM is an open source LLM inference engine that supports the following features:

Efficient KV cache memory management with PagedAttention
AWQ quantization
Continuous batching
Streaming output
Efficient implementation of decoding strategies (parallel decoding, beam search, etc.)
Multi-GPU support
Integration with HuggingFace

Deploying vLLM instances with Ray Serve provides the additional features:

Multi-server model deployment
Autoscaling
Failure recovery

Setup

Requirements:

OS: Linux
Python: 3.8 or higher
GPU: CUDA compute capability 7.0 or higher (V100, T4, A2, A16, A10, A100, H100, etc.)
CUDA Toolkit 12.1 and later

If you deploy on RunPod it is recommended to use Docker image runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04.

Install from Github:

pip install git+https://github.com/asprenger/ray_vllm_inference

Install in develop mode:

git clone https://github.com/asprenger/ray_vllm_inference
cd ray_vllm_inference
pip install -e .

Usage

Getting started

Lauch the service with a facebook/opt-125m model from HuggingFace:

serve run ray_vllm_inference.vllm_serve:deployment model="facebook/opt-125m"

This command launches a local Ray cluster, downloads the model from HuggingFace and starts a Ray Serve instance on localhost port 8000.

Call the service with a simple prompt:

curl --header "Content-Type: application/json" --data '{ "prompt":"The capital of France is ", "max_tokens":32, "temperature":0}' http://127.0.0.1:8000/generate

Note that facebook/opt-125m is a toy model and the output is often garbled.

See ray_vllm_inference/protocol.py::GenerateRequest for a list of supported request parameters.

Use a Llama-2 model

The official Llama-2 models on HuggingFace are gated models that require access permission. To use the model you need a HuggingFace access token with READ permission.

export HUGGING_FACE_HUB_TOKEN={YOUR_HF_TOKEN}
serve run ray_vllm_inference.vllm_serve:deployment model="meta-llama/Llama-2-7b-chat-hf"

This command launches a local Ray cluster and starts a Ray Serve instance that listens on localhost port 8000.

Call the service with a system prompt and a user message:

curl --header "Content-Type: application/json" --data '{ "messages":[{"role":"system", "content":"You are an expert assistant. Always give a short reply."}, {"role":"user", "content":"What is the capital of France?"}], "max_tokens":32, "temperature":0}' http://127.0.0.1:8000/generate

Use a Llama-2 model with AWQ quantization

Activation-aware Weight Quantization (AWQ) is an 4-bit quantization method for LLMs.

Launch the service with a quantized Llama-2-7b model:

serve run ray_vllm_inference.vllm_serve:deployment model="asprenger/meta-llama-Llama-2-7b-chat-hf-gemm-w4-g128-awq" quantization="awq"

Call the service with a system prompt and a user message:

curl --header "Content-Type: application/json" --data '{ "messages":[{"role":"system", "content":"You are an expert assistant. Always give a short reply."}, {"role":"user", "content":"What is the capital of France?"}], "max_tokens":32, "temperature":0}' http://127.0.0.1:8000/generate

Streaming reponse

Test streaming response:

python -m ray_vllm_inference.streaming_client --max-tokens 2048 --user-message "What can I do on a weekend trip to London?"

Benchmarks

HTTP service

Apache Benchmark is a simple tool to benchmark HTTP services.

Install Apache Benchmark:

sudo apt update
sudo apt -y install apache2-utils

Create a file postdata.json with a POST request payload. For example:

{"prompt":"TEST_PROMPT with length N tokens", "max_tokens":128, "temperature":0, "ignore_eos":true}

When benchmarking a LLM you usually want to fix the length of the input prompt and the length of the generated output. The ignore_eos flag forces the LLM to always generate max_tokens.

Run a benchmark with 1000 requests and 4 concurrent clients:

ab -T "application/json" -n 1000 -c 4 -p postdata.json http://127.0.0.1:8000/generate

vLLM throughput

Benchmark Llama-2-7b:

python benchmark_throughput.py --model="meta-llama/Llama-2-7b-chat-hf" --num-prompts 1000

Output on A100:

Total time: 131.23s
Requests: 1000
Input tokens: 557060, output tokens: 149589
Throughput: 7.62 requests/s, 5384.72 tokens/s

Benchmark Llama-2-7b with AWQ:

python benchmark_throughput.py --model="asprenger/meta-llama-Llama-2-7b-chat-hf-gemm-w4-g128-awq" --quantization="awq" --num-prompts 1000

Output on A100:

Total time: 250.26s
Requests: 1000
Input tokens: 557060, output tokens: 149589
Throughput: 4.00 requests/s, 2823.66 tokens/s

Benchmark Llama-2-7b with SqueezeLLM (w4):

python benchmark_throughput.py --model="squeeze-ai-lab/sq-llama-2-7b-w4-s0" --quantization="squeezellm" --num-prompts 1000

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
benchmarks		benchmarks
docker		docker
profiling		profiling
ray-cluster		ray-cluster
ray_vllm_inference		ray_vllm_inference
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ray vLLM Interence

Setup

Usage

Getting started

Use a Llama-2 model

Use a Llama-2 model with AWQ quantization

Streaming reponse

Benchmarks

HTTP service

vLLM throughput

About

Releases

Packages

Languages

License

asprenger/ray_vllm_inference

Folders and files

Latest commit

History

Repository files navigation

Ray vLLM Interence

Setup

Usage

Getting started

Use a Llama-2 model

Use a Llama-2 model with AWQ quantization

Streaming reponse

Benchmarks

HTTP service

vLLM throughput

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages