Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/getting-started/installation_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ After installation, please apply patch to ensure uc_connector can be used:

```bash
cd $(pip show vllm | grep Location | awk '{print $2}')
# Ensure that the unified-cache-management repository you cloned is located under /vllm-workspace.
git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch
git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt-sparse.patch
```

Refer to this [issue](https://github.com/vllm-project/vllm/issues/21702) to see details of this patch's changes.
Expand Down
18 changes: 10 additions & 8 deletions docs/source/getting-started/installation_npu.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,6 @@ docker run --rm \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
Codes of vLLM and vLLM Ascend are placed in /vllm-workspace, you can refer to [vLLM-Ascend Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more information. After installation, please apply patches to ensure uc_connector can be used:
```bash
cd /vllm-workspace/vllm
git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch
cd /vllm-workspace/vllm-ascend
git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-ascend-adapt.patch
```
Refer to these issues [vllm-issue](https://github.com/vllm-project/vllm/issues/21702) and [vllm-ascend-issue](https://github.com/vllm-project/vllm-ascend/issues/2057) to see details of patches' changes.

### Build from source code
Follow commands below to install unified-cache-management:
Expand All @@ -59,6 +51,16 @@ pip install -v -e . --no-build-isolation
cd ..
```

Codes of vLLM and vLLM Ascend are placed in /vllm-workspace, you can refer to [vLLM-Ascend Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more information. After installation, please apply patches to ensure uc_connector can be used:
```bash
cd /vllm-workspace/vllm
git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch
cd /vllm-workspace/vllm-ascend
git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-ascend-adapt.patch
```
Refer to these issues [vllm-issue](https://github.com/vllm-project/vllm/issues/21702) and [vllm-ascend-issue](https://github.com/vllm-project/vllm-ascend/issues/2057) to see details of patches' changes.


## Setup from docker
Download the pre-built docker image provided or build unified-cache-management docker image by commands below:
```bash
Expand Down
8 changes: 5 additions & 3 deletions docs/source/getting-started/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,16 @@ Before you start with UCM, please make sure that you have installed UCM correctl

## Features Overview

UCM supports two key features: **Prefix Cache** and **GSA Sparsity**.
UCM supports two key features: **Prefix Cache** and **Sparse attention**.

Each feature supports both **Offline Inference** and **Online API** modes.

For quick start, just follow the [usage](./quick_start.md) guide below to launch your own inference experience;

For further research, click on the links blow to see more details of each feature:
For further research on Prefix Cache, more details are available via the link below:
- [Prefix Cache](../user-guide/prefix-cache/index.md)

Various Sparse Attention features are now available, try GSA Sparsity via the link below:
- [GSA Sparsity](../user-guide/sparse-attention/gsa.md)

## Usage
Expand All @@ -47,7 +49,7 @@ python offline_inference.py

</details>

<details>
<details open>
<summary><b>OpenAI-Compatible Online API</b></summary>

For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.
Expand Down
17 changes: 9 additions & 8 deletions docs/source/user-guide/pd-disaggregation/1p1d.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,17 @@ This example demonstrates how to run unified-cache-management with disaggregated

## Prerequisites
- UCM: Installed with reference to the Installation documentation.
- Hardware: At least 2 GPUs
- Hardware: At least 2 GPUs or 2 NPUs

## Start disaggregated service
For illustration purposes, let us assume that the model used is Qwen2.5-7B-Instruct.
For illustration purposes, let us take GPU as an example and assume the model used is Qwen2.5-7B-Instruct.Using ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify visible devices when starting service on Ascend platform.

### Run prefill server
Prefiller Launch Command:
```bash
export PYTHONHASHSEED=123456
CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
export CUDA_VISIBLE_DEVICES=0
vllm serve /home/models/Qwen2.5-7B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 1 \
--gpu_memory_utilization 0.87 \
Expand All @@ -41,8 +42,9 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
### Run decode server
Decoder Launch Command:
```bash
export PYTHONHASHSEED=123456
CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
export PYTHONHASHSEED=123456
export CUDA_VISIBLE_DEVICES=0
vllm serve /home/models/Qwen2.5-7B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 1 \
--gpu_memory_utilization 0.87 \
Expand All @@ -68,7 +70,7 @@ CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
### Run proxy server
Make sure prefill nodes and decode nodes can connect to each other.
```bash
cd vllm-workspace/unified-cache-management/ucm/pd
cd /vllm-workspace/unified-cache-management/ucm/pd
python3 toy_proxy_server.py --pd-disaggregation --host localhost --port 7802 --prefiller-host <prefill-node-ip> --prefiller-port 7800 --decoder-host <decode-node-ip> --decoder-port 7801
```

Expand All @@ -88,8 +90,7 @@ curl http://localhost:7802/v1/completions \
### Benchmark Test
Use the benchmark scripts provided by vLLM.
```bash
cd /vllm-workspace/vllm/benchmarks
python3 benchmark_serving.py \
vllm bench serve \
--backend vllm \
--dataset-name random \
--random-input-len 4096 \
Expand Down
8 changes: 4 additions & 4 deletions docs/source/user-guide/pd-disaggregation/npgd.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,8 @@ vllm serve /home/models/Qwen2.5-7B-Instruct \
Decoder Launch Command:
```bash
export PYTHONHASHSEED=123456
CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
export CUDA_VISIBLE_DEVICES=0
vllm serve /home/models/Qwen2.5-7B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 1 \
--gpu_memory_utilization 0.87 \
Expand All @@ -77,7 +78,7 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
### Run proxy server
Make sure prefill nodes and decode nodes can connect to each other.
```bash
cd vllm-workspace/unified-cache-management/ucm/pd
cd /vllm-workspace/unified-cache-management/ucm/pd
python3 toy_proxy_server.py --host localhost --port 7802 --prefiller-host <prefill-node-ip> --prefiller-port 7800 --decoder-host <decode-node-ip> --decoder-port 7801
```

Expand All @@ -97,8 +98,7 @@ curl http://localhost:7802/v1/completions \
### Benchmark Test
Use the benchmark scripts provided by vLLM.
```bash
cd /vllm-workspace/vllm/benchmarks
python3 benchmark_serving.py \
vllm bench serve \
--backend vllm \
--dataset-name random \
--random-input-len 4096 \
Expand Down
22 changes: 13 additions & 9 deletions docs/source/user-guide/pd-disaggregation/xpyd.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,17 @@ This example demonstrates how to run unified-cache-management with disaggregated

## Prerequisites
- UCM: Installed with reference to the Installation documentation.
- Hardware: At least 4 GPUs (At least 2 GPUs for prefiller + 2 for decoder in 2d2p setup)
- Hardware: At least 4 GPUs (At least 2 GPUs for prefiller + 2 for decoder in 2d2p setup or 2 NPUs for prefiller + 2 for decoder in 2d2p setup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not match for 'At least 4 GPUs' and 'or 2 NPUs for prefiller + 2 for decoder in 2d2p setup'


## Start disaggregated service
For illustration purposes, let us assume that the model used is Qwen2.5-7B-Instruct.
For illustration purposes, let us take GPU as an example and assume the model used is Qwen2.5-7B-Instruct.Using ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify visible devices when starting service on Ascend platform.

### Run prefill servers
Prefiller1 Launch Command:
```bash
export PYTHONHASHSEED=123456
CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
export CUDA_VISIBLE_DEVICES=0
vllm serve /home/models/Qwen2.5-7B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 1 \
--gpu_memory_utilization 0.87 \
Expand All @@ -40,7 +42,8 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
Prefiller2 Launch Command:
```bash
export PYTHONHASHSEED=123456
CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
export CUDA_VISIBLE_DEVICES=1
vllm serve /home/models/Qwen2.5-7B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 1 \
--gpu_memory_utilization 0.87 \
Expand Down Expand Up @@ -68,7 +71,8 @@ CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
Decoder1 Launch Command:
```bash
export PYTHONHASHSEED=123456
CUDA_VISIBLE_DEVICES=2 vllm serve /home/models/Qwen2.5-7B-Instruct \
export CUDA_VISIBLE_DEVICES=2
vllm serve /home/models/Qwen2.5-7B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 1 \
--gpu_memory_utilization 0.87 \
Expand All @@ -94,7 +98,8 @@ CUDA_VISIBLE_DEVICES=2 vllm serve /home/models/Qwen2.5-7B-Instruct \
Decoder2 Launch Command:
```bash
export PYTHONHASHSEED=123456
CUDA_VISIBLE_DEVICES=3 vllm serve /home/models/Qwen2.5-7B-Instruct \
export CUDA_VISIBLE_DEVICES=3
vllm serve /home/models/Qwen2.5-7B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 1 \
--gpu_memory_utilization 0.87 \
Expand All @@ -121,7 +126,7 @@ CUDA_VISIBLE_DEVICES=3 vllm serve /home/models/Qwen2.5-7B-Instruct \
### Run proxy server
Make sure prefill nodes and decode nodes can connect to each other. the number of prefill/decode hosts should be equal to the number of prefill/decode ports.
```bash
cd vllm-workspace/unified-cache-management/ucm/pd
cd /vllm-workspace/unified-cache-management/ucm/pd
python3 toy_proxy_server.py --pd-disaggregation --host localhost --port 7805 --prefiller-hosts <prefill-node-ip-1> <prefill-node-ip-2> --prefiller-port 7800 7801 --decoder-hosts <decoder-node-ip-1> <decoder-node-ip-2> --decoder-ports 7802 7803
```

Expand All @@ -141,8 +146,7 @@ curl http://localhost:7805/v1/completions \
### Benchmark Test
Use the benchmark scripts provided by vLLM.
```bash
cd /vllm-workspace/vllm/benchmarks
python3 benchmark_serving.py \
vllm bench serve \
--backend vllm \
--dataset-name random \
--random-input-len 4096 \
Expand Down