Merge pull request #67 from ShangmingCai/fix_doc_to_enable_cuda_graph…

…_again [Doc] Re-enable cuda graph to improve inference performance.
kvcache-ai · Jan 7, 2025 · 7f26353 · 7f26353
2 parents 295d094 + 27a18ac
commit 7f26353
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/doc/en/vllm-integration-v0.2.md b/doc/en/vllm-integration-v0.2.md
@@ -89,7 +89,7 @@ MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.en
 - The `--model` parameter specifies the model to use.
 - The `--port` parameter specifies the vllm service port on which to listen.
 - The `--max-model-len` parameter specifies the maximum length of the model.
-- Option `--tensor_parallel_size` \ `-tp` is supported now. But you need to set up `--enforce_eager` to disable cuda graph. Example: append `-tp 2 --enforce_eager` to the run command.
+- Option `--tensor_parallel_size` \ `-tp` is supported now. Example: append `-tp 2` to the run command to run vllm with multiple GPUs.
   - If you want to run the prefill instance and decode instance on the same node, please set up different `CUDA_VISIBLE_DEVICES`. For example, `CUDA_VISIBLE_DEVICES=0,1` for the prefill instance and `CUDA_VISIBLE_DEVICES=2,3` for the decode instance.
 
 - The `--kv-transfer-config` parameter specifies the connector and its config to be used.