ModelEngine-Group · hero0307 · Nov 5, 2025 · Nov 6, 2025 · flesher0813 · Nov 6, 2025
@@ -48,8 +48,8 @@ After installation, please apply patch to ensure uc_connector can be used:
 
 ```bash
 cd $(pip show vllm | grep Location | awk '{print $2}')
+# Ensure that the unified-cache-management repository you cloned is located under /vllm-workspace.
 git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch
-git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt-sparse.patch
 ``` 
 
 Refer to this [issue](https://github.com/vllm-project/vllm/issues/21702) to see details of this patch's changes.

@@ -39,14 +39,6 @@ docker run --rm \
     -v /root/.cache:/root/.cache \
     -it $IMAGE bash
 ```
-Codes of vLLM and vLLM Ascend are placed in /vllm-workspace, you can refer to [vLLM-Ascend Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more information. After installation, please apply patches to ensure uc_connector can be used:
-```bash
-cd /vllm-workspace/vllm
-git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch
-cd /vllm-workspace/vllm-ascend
-git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-ascend-adapt.patch
-```
-Refer to these issues [vllm-issue](https://github.com/vllm-project/vllm/issues/21702) and [vllm-ascend-issue](https://github.com/vllm-project/vllm-ascend/issues/2057) to see details of patches' changes.
 
 ### Build from source code
 Follow commands below to install unified-cache-management:
@@ -59,6 +51,16 @@ pip install -v -e . --no-build-isolation
 cd ..
 ```
 
+Codes of vLLM and vLLM Ascend are placed in /vllm-workspace, you can refer to [vLLM-Ascend Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more information. After installation, please apply patches to ensure uc_connector can be used:
+```bash
+cd /vllm-workspace/vllm
+git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch
+cd /vllm-workspace/vllm-ascend
+git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-ascend-adapt.patch
+```
+Refer to these issues [vllm-issue](https://github.com/vllm-project/vllm/issues/21702) and [vllm-ascend-issue](https://github.com/vllm-project/vllm-ascend/issues/2057) to see details of patches' changes.
+
+
 ## Setup from docker
 Download the pre-built docker image provided or build unified-cache-management docker image by commands below:
  ```bash

@@ -21,14 +21,16 @@ Before you start with UCM, please make sure that you have installed UCM correctl
 
 ## Features Overview
 
-UCM supports two key features: **Prefix Cache** and **GSA Sparsity**. 
+UCM supports two key features: **Prefix Cache** and **Sparse attention**. 
 
 Each feature supports both **Offline Inference** and **Online API** modes. 
 
 For quick start, just follow the [usage](./quick_start.md) guide below to launch your own inference experience;
 
-For further research, click on the links blow to see more details of each feature:
+For further research on Prefix Cache, more details are available via the link below:
 - [Prefix Cache](../user-guide/prefix-cache/index.md)
+
+Various Sparse Attention features are now available, try GSA Sparsity via the link below:
 - [GSA Sparsity](../user-guide/sparse-attention/gsa.md)
 
 ## Usage
@@ -47,7 +49,7 @@ python offline_inference.py
 
 </details>
 
-<details>
+<details open>
 <summary><b>OpenAI-Compatible Online API</b></summary>
 
 For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.

@@ -5,16 +5,17 @@ This example demonstrates how to run unified-cache-management with disaggregated
 
 ## Prerequisites
 - UCM: Installed with reference to the Installation documentation.
-- Hardware: At least 2 GPUs
+- Hardware: At least 2 GPUs or 2 NPUs
 
 ## Start disaggregated service
-For illustration purposes, let us assume that the model used is Qwen2.5-7B-Instruct.
+For illustration purposes, let us take GPU as an example and assume the model used is Qwen2.5-7B-Instruct.Using ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify visible devices when starting service on Ascend platform.
 
 ### Run prefill server
 Prefiller Launch Command:
 ```bash
 export PYTHONHASHSEED=123456
-CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
+export CUDA_VISIBLE_DEVICES=0 
+vllm serve /home/models/Qwen2.5-7B-Instruct \
 --max-model-len 20000 \
 --tensor-parallel-size 1 \
 --gpu_memory_utilization 0.87 \
@@ -41,8 +42,9 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
 ### Run decode server
 Decoder Launch Command:
 ```bash
-export PYTHONHASHSEED=123456
-CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
+export PYTHONHASHSEED=123456 
+export CUDA_VISIBLE_DEVICES=0 
+vllm serve /home/models/Qwen2.5-7B-Instruct \
 --max-model-len 20000 \
 --tensor-parallel-size 1 \
 --gpu_memory_utilization 0.87 \
@@ -68,7 +70,7 @@ CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
 ### Run proxy server
 Make sure prefill nodes and decode nodes can connect to each other.
 ```bash
-cd vllm-workspace/unified-cache-management/ucm/pd
+cd /vllm-workspace/unified-cache-management/ucm/pd
 python3 toy_proxy_server.py --pd-disaggregation --host localhost --port 7802 --prefiller-host <prefill-node-ip> --prefiller-port 7800 --decoder-host <decode-node-ip> --decoder-port 7801
 ```
 
@@ -88,8 +90,7 @@ curl http://localhost:7802/v1/completions \
 ### Benchmark Test
 Use the benchmark scripts provided by vLLM.
 ```bash
-cd /vllm-workspace/vllm/benchmarks
-python3 benchmark_serving.py \
+vllm bench serve \
     --backend vllm \
     --dataset-name random \
     --random-input-len 4096 \

@@ -50,7 +50,8 @@ vllm serve /home/models/Qwen2.5-7B-Instruct \
 Decoder Launch Command:
 ```bash
 export PYTHONHASHSEED=123456
-CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
+export CUDA_VISIBLE_DEVICES=0 
+vllm serve /home/models/Qwen2.5-7B-Instruct \
 --max-model-len 20000 \
 --tensor-parallel-size 1 \
 --gpu_memory_utilization 0.87 \
@@ -77,7 +78,7 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
 ### Run proxy server
 Make sure prefill nodes and decode nodes can connect to each other.
 ```bash
-cd vllm-workspace/unified-cache-management/ucm/pd
+cd /vllm-workspace/unified-cache-management/ucm/pd
 python3 toy_proxy_server.py --host localhost --port 7802 --prefiller-host <prefill-node-ip> --prefiller-port 7800 --decoder-host <decode-node-ip> --decoder-port 7801
 ```
 
@@ -97,8 +98,7 @@ curl http://localhost:7802/v1/completions \
 ### Benchmark Test
 Use the benchmark scripts provided by vLLM.
 ```bash
-cd /vllm-workspace/vllm/benchmarks
-python3 benchmark_serving.py \
+vllm bench serve \
     --backend vllm \
     --dataset-name random \
     --random-input-len 4096 \

@@ -5,15 +5,17 @@ This example demonstrates how to run unified-cache-management with disaggregated
 
 ## Prerequisites
 - UCM: Installed with reference to the Installation documentation.
-- Hardware: At least 4 GPUs (At least 2 GPUs for prefiller + 2 for decoder in 2d2p setup)
+- Hardware: At least 4 GPUs (At least 2 GPUs for prefiller + 2 for decoder in 2d2p setup or 2 NPUs for prefiller + 2 for decoder in 2d2p setup)
 
 ## Start disaggregated service
-For illustration purposes, let us assume that the model used is Qwen2.5-7B-Instruct.
+For illustration purposes, let us take GPU as an example and assume the model used is Qwen2.5-7B-Instruct.Using ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify visible devices when starting service on Ascend platform.
+
 ### Run prefill servers
 Prefiller1 Launch Command:
 ```bash
 export PYTHONHASHSEED=123456
-CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
+export CUDA_VISIBLE_DEVICES=0 
+vllm serve /home/models/Qwen2.5-7B-Instruct \
 --max-model-len 20000 \
 --tensor-parallel-size 1 \
 --gpu_memory_utilization 0.87 \
@@ -40,7 +42,8 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
 Prefiller2 Launch Command:
 ```bash
 export PYTHONHASHSEED=123456
-CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
+export CUDA_VISIBLE_DEVICES=1 
+vllm serve /home/models/Qwen2.5-7B-Instruct \
 --max-model-len 20000 \
 --tensor-parallel-size 1 \
 --gpu_memory_utilization 0.87 \
@@ -68,7 +71,8 @@ CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
 Decoder1 Launch Command:
 ```bash
 export PYTHONHASHSEED=123456
-CUDA_VISIBLE_DEVICES=2 vllm serve /home/models/Qwen2.5-7B-Instruct \
+export CUDA_VISIBLE_DEVICES=2 
+vllm serve /home/models/Qwen2.5-7B-Instruct \
 --max-model-len 20000 \
 --tensor-parallel-size 1 \
 --gpu_memory_utilization 0.87 \
@@ -94,7 +98,8 @@ CUDA_VISIBLE_DEVICES=2 vllm serve /home/models/Qwen2.5-7B-Instruct \
 Decoder2 Launch Command:
 ```bash
 export PYTHONHASHSEED=123456
-CUDA_VISIBLE_DEVICES=3 vllm serve /home/models/Qwen2.5-7B-Instruct \
+export CUDA_VISIBLE_DEVICES=3 
+vllm serve /home/models/Qwen2.5-7B-Instruct \
 --max-model-len 20000 \
 --tensor-parallel-size 1 \
 --gpu_memory_utilization 0.87 \
@@ -121,7 +126,7 @@ CUDA_VISIBLE_DEVICES=3 vllm serve /home/models/Qwen2.5-7B-Instruct \
 ### Run proxy server
 Make sure prefill nodes and decode nodes can connect to each other. the number of prefill/decode hosts should be equal to the number of prefill/decode ports.
 ```bash
-cd vllm-workspace/unified-cache-management/ucm/pd
+cd /vllm-workspace/unified-cache-management/ucm/pd
 python3 toy_proxy_server.py --pd-disaggregation --host localhost --port 7805 --prefiller-hosts <prefill-node-ip-1> <prefill-node-ip-2> --prefiller-port 7800 7801 --decoder-hosts <decoder-node-ip-1> <decoder-node-ip-2> --decoder-ports 7802 7803
 ```
 
@@ -141,8 +146,7 @@ curl http://localhost:7805/v1/completions \
 ### Benchmark Test
 Use the benchmark scripts provided by vLLM.
 ```bash
-cd /vllm-workspace/vllm/benchmarks
-python3 benchmark_serving.py \
+vllm bench serve \
     --backend vllm \
     --dataset-name random \
     --random-input-len 4096 \