vLLM is a fast and easy-to-use library for LLM inference and serving. You can find the detailed information at their homepage.
IPEX-LLM can be integrated into vLLM so that user can use IPEX-LLM
to boost the performance of vLLM engine on Intel GPUs (e.g., local PC with descrete GPU such as Arc, Flex and Max).
Currently, IPEX-LLM integrated vLLM only supports the following models:
- Qwen series models
- Llama series models
- ChatGLM series models
- Baichuan series models
- Install IPEX-LLM for vLLM
- Install vLLM
- Offline Inference/Service
- About Tensor Parallel
- Performing Benchmark
This quickstart guide walks you through installing and running vLLM
with ipex-llm
.
IPEX-LLM's support for vLLM
now is available for only Linux system.
Visit Install IPEX-LLM on Linux with Intel GPU and follow the instructions in section Install Prerequisites to isntall prerequisites that are needed for running code on Intel GPUs.
Then, follow instructions in section Install ipex-llm to install ipex-llm[xpu]
and setup the recommended runtime configurations.
After the installation, you should have created a conda environment, named ipex-vllm
for instance, for running vLLM
commands with IPEX-LLM.
Currently, we maintain a specific branch of vLLM, which only works on Intel GPUs.
Activate the ipex-vllm
conda environment and install vLLM by execcuting the commands below.
conda activate ipex-vllm
source /opt/intel/oneapi/setvars.sh
git clone -b sycl_xpu https://github.com/analytics-zoo/vllm.git
cd vllm
pip install -r requirements-xpu.txt
pip install --no-deps xformers
VLLM_BUILD_XPU_OPS=1 pip install --no-build-isolation -v -e .
pip install outlines==0.0.34 --no-deps
pip install interegular cloudpickle diskcache joblib lark nest-asyncio numba scipy
# For Qwen model support
pip install transformers_stream_generator einops tiktoken
Now you are all set to use vLLM with IPEX-LLM
To run offline inference using vLLM for a quick impression, use the following example.
Note
Please modify the MODEL_PATH in offline_inference.py to use your chosen model.
You can try modify load_in_low_bit to different values in [sym_int4, fp6, fp8, fp8_e4m3, fp16] to use different quantization dtype.
#!/bin/bash
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/vLLM-Serving/offline_inference.py
python offline_inference.py
For instructions on how to change the load_in_low_bit
value in offline_inference.py
, check the following example:
llm = LLM(model="YOUR_MODEL",
device="xpu",
dtype="float16",
enforce_eager=True,
# Simply change here for the desired load_in_low_bit value
load_in_low_bit="sym_int4",
tensor_parallel_size=1,
trust_remote_code=True)
The result of executing Baichuan2-7B-Chat
model with sym_int4
low-bit format is shown as follows:
Prompt: 'Hello, my name is', Generated text: ' [Your Name] and I am a [Your Job Title] at [Your'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government in the United States. The president leads'
Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.'
Prompt: 'The future of AI is', Generated text: " bright, but it's not without challenges. As AI continues to evolve,"
Note
Because of using JIT compilation for kernels. We recommend to send a few requests for warmup before using the service for the best performance.
To fully utilize the continuous batching feature of the vLLM
, you can send requests to the service using curl
or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same forward
step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished.
For vLLM, you can start the service using the following command:
#!/bin/bash
model="YOUR_MODEL_PATH"
served_model_name="YOUR_MODEL_NAME"
# You may need to adjust the value of
# --max-model-len, --max-num-batched-tokens, --max-num-seqs
# to acquire the best performance
# Change value --load-in-low-bit to [fp6, fp8, fp8_e4m3, fp16] to use different low-bit formats
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.75 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 4096 \
--max-num-batched-tokens 10240 \
--max-num-seqs 12 \
--tensor-parallel-size 1
You can tune the service using these four arguments:
--gpu-memory-utilization
: The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.--max-model-len
: Model context length. If unspecified, will be automatically derived from the model config.--max-num-batched-token
: Maximum number of batched tokens per iteration.--max-num-seq
: Maximum number of sequences per iteration. Default: 256
For longer input prompt, we would suggest to use --max-num-batched-token
to restrict the service. The reason behind this logic is that the peak GPU memory usage
will appear when generating first token. By using --max-num-batched-token
, we can restrict the input size when generating first token.
--max-num-seqs
will restrict the generation for both first token and rest token. It will restrict the maximum batch size to the value set by --max-num-seqs
.
When out-of-memory error occurs, the most obvious solution is to reduce the gpu-memory-utilization
. Other ways to resolve this error is to set --max-num-batched-token
if peak memory occurs when generating first token or using --max-num-seq
if peak memory occurs when generating rest tokens.
If the service have been booted successfully, the console will display messages similar to the following:
After the service has been booted successfully, you can send a test request using curl
. Here, YOUR_MODEL
should be set equal to $served_model_name
in your booting script, e.g. Qwen1.5
.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL",
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}' | jq '.choices[0].text'
Below shows an example output using Qwen1.5-7B-Chat
with low-bit format sym_int4
:
Tip
If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before starting the service:
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
Note
We recommend to use docker for tensor parallel deployment. Check our serving docker image intelanalytics/ipex-llm-serving-xpu
.
We have also supported tensor parallel by using multiple Intel GPU cards. To enable tensor parallel, you will need to install libfabric-dev
in your environment. In ubuntu, you can install it by:
sudo apt-get install libfabric-dev
To deploy your model across multiple cards, simplely change the value of --tensor-parallel-size
to the desired value.
For instance, if you have two Arc A770 cards in your environment, then you can set this value to 2. Some OneCCL environment variable settings are also needed, check the following example:
#!/bin/bash
model="YOUR_MODEL_PATH"
served_model_name="YOUR_MODEL_NAME"
# CCL needed environment variables
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
# You may need to adjust the value of
# --max-model-len, --max-num-batched-tokens, --max-num-seqs
# to acquire the best performance
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.75 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 4096 \
--max-num-batched-tokens 10240 \
--max-num-seqs 12 \
--tensor-parallel-size 2
If the service have booted successfully, you should see the output similar to the following figure:
To perform benchmark, you can use the benchmark_throughput script that is originally provided by vLLM repo.
conda activate ipex-vllm
source /opt/intel/oneapi/setvars.sh
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/docker/llm/serving/xpu/docker/benchmark_vllm_throughput.py -O benchmark_throughput.py
export MODEL="YOUR_MODEL"
# You can change load-in-low-bit from values in [sym_int4, fp6, fp8, fp8_e4m3, fp16]
python3 ./benchmark_throughput.py \
--backend vllm \
--dataset ./ShareGPT_V3_unfiltered_cleaned_split.json \
--model $MODEL \
--num-prompts 1000 \
--seed 42 \
--trust-remote-code \
--enforce-eager \
--dtype float16 \
--device xpu \
--load-in-low-bit sym_int4 \
--gpu-memory-utilization 0.85
The following figure shows the result of benchmarking Llama-2-7b-chat-hf
using 50 prompts:
Tip
To find the best config that fits your workload, you may need to start the service and use tools like wrk
or jmeter
to perform a stress tests.