diff --git a/docs/source/getting-started/installation_npu.md b/docs/source/getting-started/installation_npu.md index 97b7afab..59b13d90 100644 --- a/docs/source/getting-started/installation_npu.md +++ b/docs/source/getting-started/installation_npu.md @@ -62,6 +62,16 @@ pip install -v -e . --no-build-isolation cd .. ``` +Codes of vLLM and vLLM Ascend are placed in /vllm-workspace, you can refer to [vLLM-Ascend Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more information. After installation, please apply patches to ensure uc_connector can be used: +```bash +cd /vllm-workspace/vllm +git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch +cd /vllm-workspace/vllm-ascend +git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-ascend-adapt.patch +``` +Refer to these issues [vllm-issue](https://github.com/vllm-project/vllm/issues/21702) and [vllm-ascend-issue](https://github.com/vllm-project/vllm-ascend/issues/2057) to see details of patches' changes. + + ## Setup from docker Download the pre-built docker image provided or build unified-cache-management docker image by commands below: ```bash diff --git a/docs/source/getting-started/quick_start.md b/docs/source/getting-started/quick_start.md index 606441b6..9e7630e1 100644 --- a/docs/source/getting-started/quick_start.md +++ b/docs/source/getting-started/quick_start.md @@ -21,14 +21,16 @@ Before you start with UCM, please make sure that you have installed UCM correctl ## Features Overview -UCM supports two key features: **Prefix Cache** and **GSA Sparsity**. +UCM supports two key features: **Prefix Cache** and **Sparse attention**. Each feature supports both **Offline Inference** and **Online API** modes. For quick start, just follow the [usage](./quick_start.md) guide below to launch your own inference experience; -For further research, click on the links blow to see more details of each feature: +For further research on Prefix Cache, more details are available via the link below: - [Prefix Cache](../user-guide/prefix-cache/index.md) + +Various Sparse Attention features are now available, try GSA Sparsity via the link below: - [GSA Sparsity](../user-guide/sparse-attention/gsa.md) ## Usage @@ -47,7 +49,7 @@ python offline_inference.py -
+
OpenAI-Compatible Online API For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. diff --git a/docs/source/user-guide/pd-disaggregation/1p1d.md b/docs/source/user-guide/pd-disaggregation/1p1d.md index 53debeb2..fb3f4d05 100644 --- a/docs/source/user-guide/pd-disaggregation/1p1d.md +++ b/docs/source/user-guide/pd-disaggregation/1p1d.md @@ -5,16 +5,17 @@ This example demonstrates how to run unified-cache-management with disaggregated ## Prerequisites - UCM: Installed with reference to the Installation documentation. -- Hardware: At least 2 GPUs +- Hardware: At least 2 GPUs or 2 NPUs ## Start disaggregated service -For illustration purposes, let us assume that the model used is Qwen2.5-7B-Instruct. +For illustration purposes, let us take GPU as an example and assume the model used is Qwen2.5-7B-Instruct.Using ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify visible devices when starting service on Ascend platform. ### Run prefill server Prefiller Launch Command: ```bash export PYTHONHASHSEED=123456 -CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \ +export CUDA_VISIBLE_DEVICES=0 +vllm serve /home/models/Qwen2.5-7B-Instruct \ --max-model-len 20000 \ --tensor-parallel-size 1 \ --gpu_memory_utilization 0.87 \ @@ -41,8 +42,9 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \ ### Run decode server Decoder Launch Command: ```bash -export PYTHONHASHSEED=123456 -CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \ +export PYTHONHASHSEED=123456 +export CUDA_VISIBLE_DEVICES=0 +vllm serve /home/models/Qwen2.5-7B-Instruct \ --max-model-len 20000 \ --tensor-parallel-size 1 \ --gpu_memory_utilization 0.87 \ @@ -68,7 +70,7 @@ CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \ ### Run proxy server Make sure prefill nodes and decode nodes can connect to each other. ```bash -cd vllm-workspace/unified-cache-management/ucm/pd +cd /vllm-workspace/unified-cache-management/ucm/pd python3 toy_proxy_server.py --pd-disaggregation --host localhost --port 7802 --prefiller-host --prefiller-port 7800 --decoder-host --decoder-port 7801 ``` @@ -88,8 +90,7 @@ curl http://localhost:7802/v1/completions \ ### Benchmark Test Use the benchmark scripts provided by vLLM. ```bash -cd /vllm-workspace/vllm/benchmarks -python3 benchmark_serving.py \ +vllm bench serve \ --backend vllm \ --dataset-name random \ --random-input-len 4096 \ diff --git a/docs/source/user-guide/pd-disaggregation/npgd.md b/docs/source/user-guide/pd-disaggregation/npgd.md index e35b2aef..c4919779 100644 --- a/docs/source/user-guide/pd-disaggregation/npgd.md +++ b/docs/source/user-guide/pd-disaggregation/npgd.md @@ -50,7 +50,8 @@ vllm serve /home/models/Qwen2.5-7B-Instruct \ Decoder Launch Command: ```bash export PYTHONHASHSEED=123456 -CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \ +export CUDA_VISIBLE_DEVICES=0 +vllm serve /home/models/Qwen2.5-7B-Instruct \ --max-model-len 20000 \ --tensor-parallel-size 1 \ --gpu_memory_utilization 0.87 \ @@ -77,7 +78,7 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \ ### Run proxy server Make sure prefill nodes and decode nodes can connect to each other. ```bash -cd vllm-workspace/unified-cache-management/ucm/pd +cd /vllm-workspace/unified-cache-management/ucm/pd python3 toy_proxy_server.py --host localhost --port 7802 --prefiller-host --prefiller-port 7800 --decoder-host --decoder-port 7801 ``` @@ -97,8 +98,7 @@ curl http://localhost:7802/v1/completions \ ### Benchmark Test Use the benchmark scripts provided by vLLM. ```bash -cd /vllm-workspace/vllm/benchmarks -python3 benchmark_serving.py \ +vllm bench serve \ --backend vllm \ --dataset-name random \ --random-input-len 4096 \ diff --git a/docs/source/user-guide/pd-disaggregation/xpyd.md b/docs/source/user-guide/pd-disaggregation/xpyd.md index c5b9d705..a57ab5d2 100644 --- a/docs/source/user-guide/pd-disaggregation/xpyd.md +++ b/docs/source/user-guide/pd-disaggregation/xpyd.md @@ -5,15 +5,17 @@ This example demonstrates how to run unified-cache-management with disaggregated ## Prerequisites - UCM: Installed with reference to the Installation documentation. -- Hardware: At least 4 GPUs (At least 2 GPUs for prefiller + 2 for decoder in 2d2p setup) +- Hardware: At least 4 GPUs (At least 2 GPUs for prefiller + 2 for decoder in 2d2p setup or 2 NPUs for prefiller + 2 for decoder in 2d2p setup) ## Start disaggregated service -For illustration purposes, let us assume that the model used is Qwen2.5-7B-Instruct. +For illustration purposes, let us take GPU as an example and assume the model used is Qwen2.5-7B-Instruct.Using ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify visible devices when starting service on Ascend platform. + ### Run prefill servers Prefiller1 Launch Command: ```bash export PYTHONHASHSEED=123456 -CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \ +export CUDA_VISIBLE_DEVICES=0 +vllm serve /home/models/Qwen2.5-7B-Instruct \ --max-model-len 20000 \ --tensor-parallel-size 1 \ --gpu_memory_utilization 0.87 \ @@ -40,7 +42,8 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \ Prefiller2 Launch Command: ```bash export PYTHONHASHSEED=123456 -CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \ +export CUDA_VISIBLE_DEVICES=1 +vllm serve /home/models/Qwen2.5-7B-Instruct \ --max-model-len 20000 \ --tensor-parallel-size 1 \ --gpu_memory_utilization 0.87 \ @@ -68,7 +71,8 @@ CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \ Decoder1 Launch Command: ```bash export PYTHONHASHSEED=123456 -CUDA_VISIBLE_DEVICES=2 vllm serve /home/models/Qwen2.5-7B-Instruct \ +export CUDA_VISIBLE_DEVICES=2 +vllm serve /home/models/Qwen2.5-7B-Instruct \ --max-model-len 20000 \ --tensor-parallel-size 1 \ --gpu_memory_utilization 0.87 \ @@ -94,7 +98,8 @@ CUDA_VISIBLE_DEVICES=2 vllm serve /home/models/Qwen2.5-7B-Instruct \ Decoder2 Launch Command: ```bash export PYTHONHASHSEED=123456 -CUDA_VISIBLE_DEVICES=3 vllm serve /home/models/Qwen2.5-7B-Instruct \ +export CUDA_VISIBLE_DEVICES=3 +vllm serve /home/models/Qwen2.5-7B-Instruct \ --max-model-len 20000 \ --tensor-parallel-size 1 \ --gpu_memory_utilization 0.87 \ @@ -121,7 +126,7 @@ CUDA_VISIBLE_DEVICES=3 vllm serve /home/models/Qwen2.5-7B-Instruct \ ### Run proxy server Make sure prefill nodes and decode nodes can connect to each other. the number of prefill/decode hosts should be equal to the number of prefill/decode ports. ```bash -cd vllm-workspace/unified-cache-management/ucm/pd +cd /vllm-workspace/unified-cache-management/ucm/pd python3 toy_proxy_server.py --pd-disaggregation --host localhost --port 7805 --prefiller-hosts --prefiller-port 7800 7801 --decoder-hosts --decoder-ports 7802 7803 ``` @@ -141,8 +146,7 @@ curl http://localhost:7805/v1/completions \ ### Benchmark Test Use the benchmark scripts provided by vLLM. ```bash -cd /vllm-workspace/vllm/benchmarks -python3 benchmark_serving.py \ +vllm bench serve \ --backend vllm \ --dataset-name random \ --random-input-len 4096 \