SiLLM is a high-performance asynchronous inference engine designed to optimize model execution through two parallelism mechanisms.
-
GPU-CPU Overlapping
- Fully asynchronous inference scheduling
- Fully asynchronous input processing
- Fully asynchronous output processing
-
Sequence-Parallel Sampling
- Fully parallel sampling across GPUs
SiLLM is built on top of vLLM, utilizing vLLM's front end for model loading and leveraging PagedAttention for model execution. Additionally, it integrates custom plugins to enable asynchronous scheduling, asynchronous input/output processing, and parallel sampling.
Step 1: Install vLLM from pip
# Install vLLM
pip install openai==1.45.0 gputil aioprometheus psutil transformers termcolor ipywidgets
pip install vllm==0.6.0
Step 2: Install Albireo plugin from source
# Install Albireo Plugin
python3 python_only_dev.py
apt-get install libboost-all-dev
cd albireo
pip install -v .
This library is licensed under the Apache 2.0 License.