COMB is a plug-and-play caching system for long-context LLM serving.
COMB
├── benchmarks # For benchmarking
├── comb
│ ├── entrypoints
│ │ ├── api_server.py # For online server
│ │ └── comb.py # For offline inference
│ ├── integration
│ │ ├── hf # hf transformers backend
│ │ ├── vllm # vLLM backend
│ │ └── __init__.py
│ ├── storage
│ │ ├── chunk_processor.py # For generating PIC
│ │ ├── pic_allocator.py # For allocating memory
│ │ ├── pic_manager.py # For managing PIC
│ │ └── pic_utils.py
│ ├── transfer
│ │ └── cuda_ipc_utils.py # For inter-process communication
│ ├── __init__.py
│ ├── output.py
│ └── supported_models.py
├── data
├── examples # For use case
├── training # For training
├── environment.yml
└── requirements.txt
Run the following commands to prepare the environment. We recommend appending two export commands to the end of ~/.bashrc.
export PYTHONPATH=~/Comb:$PYTHONPATH
export TOKENIZERS_PARALLELISM=true
pip install -r requirements.txtInstall vllm. (Recommended for efficiency and benchmarking)
pip install vllmCurrently we only support meta-llama/Llama-3.1-8B-Instruct and deepseek-ai/DeepSeek-V2-Lite-Chat. If you want to use another model, you can also train a Comb model by yourself through following our instructions.
You can find examples in the folder examples.
- basic.py for offline inference.
- online_serving.py for server.
See Instructions.
In this example, we simulate two requests with different prefixes. The requests contain the same question and retrieved context, enabling the KV cache to be reused through PIC.