Skip to content

Commit

Permalink
Merge pull request #67 from ShangmingCai/fix_doc_to_enable_cuda_graph…
Browse files Browse the repository at this point in the history
…_again

[Doc] Re-enable cuda graph to improve inference performance.
  • Loading branch information
alogfans authored Jan 7, 2025
2 parents 295d094 + 27a18ac commit 7f26353
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion doc/en/vllm-integration-v0.2.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.en
- The `--model` parameter specifies the model to use.
- The `--port` parameter specifies the vllm service port on which to listen.
- The `--max-model-len` parameter specifies the maximum length of the model.
- Option `--tensor_parallel_size` \ `-tp` is supported now. But you need to set up `--enforce_eager` to disable cuda graph. Example: append `-tp 2 --enforce_eager` to the run command.
- Option `--tensor_parallel_size` \ `-tp` is supported now. Example: append `-tp 2` to the run command to run vllm with multiple GPUs.
- If you want to run the prefill instance and decode instance on the same node, please set up different `CUDA_VISIBLE_DEVICES`. For example, `CUDA_VISIBLE_DEVICES=0,1` for the prefill instance and `CUDA_VISIBLE_DEVICES=2,3` for the decode instance.

- The `--kv-transfer-config` parameter specifies the connector and its config to be used.
Expand Down

0 comments on commit 7f26353

Please sign in to comment.