Unexpected single request slowness on 16x GH200 setup for Deepseek R1 #3504
Unanswered
siddartha-RE
asked this question in
Q&A
Replies: 1 comment 4 replies
-
Did you try to remove --enable-dp-attention?This option will improve throughput but affect latency. Open it for high concurency scenarios. Try to add --enable-torch-compile and --torch-compile-max-bs 8 to reduce latency. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We have tested VLLM and Sglang on Deepseek R1 on a 16x1 GH200 setup (16 nodes w/infiniband, each with 1 H200 GPU w/ 96GB).
With VLLM with a single request we get ~25 tokens/sec of generation (TP=16)
With SGLang with a single request the max we can get is 6.5 tokens/sec (TP=16, --enable-dp-attention)
With 20x concurrent requests SGLang scales linearly in throughput. However, we can not find an explanation for the large difference in generation speed between VLLM and SGlang for a single request. It looks like NCCL is being detected correctly and the serving is correct.
It seems like such a big difference could only be explained by slowness in node to node communication. Is there some additional steps we should be taking to ensure NCCL is operating correctly?
Beta Was this translation helpful? Give feedback.
All reactions