A service that integrates vLLM with Ray Serve for fast and scalable LLM serving.
vLLM is an open source LLM inference engine that supports the following features:
- Efficient KV cache memory management with PagedAttention
- AWQ quantization
- Continuous batching
- Streaming output
- Efficient implementation of decoding strategies (parallel decoding, beam search, etc.)
- Multi-GPU support
- Integration with HuggingFace
Deploying vLLM instances with Ray Serve provides the additional features:
- Multi-server model deployment
- Autoscaling
- Failure recovery
Requirements:
- OS: Linux
- Python: 3.8 or higher
- GPU: CUDA compute capability 7.0 or higher (V100, T4, A2, A16, A10, A100, H100, etc.)
- CUDA Toolkit 12.1 and later
If you deploy on RunPod it is recommended to use Docker image runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04
.
Install from Github:
pip install git+https://github.com/asprenger/ray_vllm_inference
Install in develop mode:
git clone https://github.com/asprenger/ray_vllm_inference
cd ray_vllm_inference
pip install -e .
Lauch the service with a facebook/opt-125m
model from HuggingFace:
serve run ray_vllm_inference.vllm_serve:deployment model="facebook/opt-125m"
This command launches a local Ray cluster, downloads the model from HuggingFace and starts a Ray Serve instance on localhost port 8000.
Call the service with a simple prompt:
curl --header "Content-Type: application/json" --data '{ "prompt":"The capital of France is ", "max_tokens":32, "temperature":0}' http://127.0.0.1:8000/generate
Note that facebook/opt-125m
is a toy model and the output is often garbled.
See ray_vllm_inference/protocol.py::GenerateRequest
for a list of supported request parameters.
The official Llama-2 models on HuggingFace are gated models that require access permission. To use the model you need a HuggingFace access token with READ permission.
export HUGGING_FACE_HUB_TOKEN={YOUR_HF_TOKEN}
serve run ray_vllm_inference.vllm_serve:deployment model="meta-llama/Llama-2-7b-chat-hf"
This command launches a local Ray cluster and starts a Ray Serve instance that listens on localhost port 8000.
Call the service with a system prompt and a user message:
curl --header "Content-Type: application/json" --data '{ "messages":[{"role":"system", "content":"You are an expert assistant. Always give a short reply."}, {"role":"user", "content":"What is the capital of France?"}], "max_tokens":32, "temperature":0}' http://127.0.0.1:8000/generate
Activation-aware Weight Quantization (AWQ) is an 4-bit quantization method for LLMs.
Launch the service with a quantized Llama-2-7b model:
serve run ray_vllm_inference.vllm_serve:deployment model="asprenger/meta-llama-Llama-2-7b-chat-hf-gemm-w4-g128-awq" quantization="awq"
Call the service with a system prompt and a user message:
curl --header "Content-Type: application/json" --data '{ "messages":[{"role":"system", "content":"You are an expert assistant. Always give a short reply."}, {"role":"user", "content":"What is the capital of France?"}], "max_tokens":32, "temperature":0}' http://127.0.0.1:8000/generate
Test streaming response:
python -m ray_vllm_inference.streaming_client --max-tokens 2048 --user-message "What can I do on a weekend trip to London?"
Apache Benchmark is a simple tool to benchmark HTTP services.
Install Apache Benchmark:
sudo apt update
sudo apt -y install apache2-utils
Create a file postdata.json
with a POST request payload. For example:
{"prompt":"TEST_PROMPT with length N tokens", "max_tokens":128, "temperature":0, "ignore_eos":true}
When benchmarking a LLM you usually want to fix the length of the input prompt and the length of the
generated output. The ignore_eos
flag forces the LLM to always generate max_tokens
.
Run a benchmark with 1000 requests and 4 concurrent clients:
ab -T "application/json" -n 1000 -c 4 -p postdata.json http://127.0.0.1:8000/generate
Benchmark Llama-2-7b:
python benchmark_throughput.py --model="meta-llama/Llama-2-7b-chat-hf" --num-prompts 1000
Output on A100:
Total time: 131.23s
Requests: 1000
Input tokens: 557060, output tokens: 149589
Throughput: 7.62 requests/s, 5384.72 tokens/s
Benchmark Llama-2-7b with AWQ:
python benchmark_throughput.py --model="asprenger/meta-llama-Llama-2-7b-chat-hf-gemm-w4-g128-awq" --quantization="awq" --num-prompts 1000
Output on A100:
Total time: 250.26s
Requests: 1000
Input tokens: 557060, output tokens: 149589
Throughput: 4.00 requests/s, 2823.66 tokens/s
Benchmark Llama-2-7b with SqueezeLLM (w4):
python benchmark_throughput.py --model="squeeze-ai-lab/sq-llama-2-7b-w4-s0" --quantization="squeezellm" --num-prompts 1000