Unexpected single request slowness on 16x GH200 setup for Deepseek R1 #3504

siddartha-RE · 2025-02-11T23:03:31Z

siddartha-RE
Feb 11, 2025

We have tested VLLM and Sglang on Deepseek R1 on a 16x1 GH200 setup (16 nodes w/infiniband, each with 1 H200 GPU w/ 96GB).
With VLLM with a single request we get ~25 tokens/sec of generation (TP=16)
With SGLang with a single request the max we can get is 6.5 tokens/sec (TP=16, --enable-dp-attention)

With 20x concurrent requests SGLang scales linearly in throughput. However, we can not find an explanation for the large difference in generation speed between VLLM and SGlang for a single request. It looks like NCCL is being detected correctly and the serving is correct.

It seems like such a big difference could only be explained by slowness in node to node communication. Is there some additional steps we should be taking to ensure NCCL is operating correctly?

ispobock · 2025-02-12T15:22:10Z

ispobock
Feb 12, 2025
Maintainer

Did you try to remove --enable-dp-attention？This option will improve throughput but affect latency. Open it for high concurency scenarios.

Try to add --enable-torch-compile and --torch-compile-max-bs 8 to reduce latency.

4 replies

siddartha-RE Feb 12, 2025
Author

We did try with and without --enable-dp-attention. Without db-attention single request throughput is <5 tokens/sec so the flag helps a little bit. I did not try torch compile because I thought that was not compatible with fp8. I can test with that enabled if that restriction has been removed.

8x2 would be great but we would like to utilize what we have!

The puzzling thing is why vllm would achieve 25-27 tokens/sec in the same setup. It seems like a lot of the core forward pass optimizations are present in both projects so it is puzzling that there is a 4-5x difference on the same hardware.

ispobock Feb 12, 2025
Maintainer

torch compile is compatible with fp8 since torch 2.5. We make it compatible with deepseek v3 since #3232.

siddartha-RE Feb 12, 2025
Author

Missed that update. Thanks. Testing in progress already!

siddartha-RE Feb 12, 2025
Author

Missed that update. Thanks. Testing in progress already!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected single request slowness on 16x GH200 setup for Deepseek R1 #3504

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Unexpected single request slowness on 16x GH200 setup for Deepseek R1 #3504

siddartha-RE Feb 11, 2025

Replies: 1 comment · 4 replies

ispobock Feb 12, 2025 Maintainer

siddartha-RE Feb 12, 2025 Author

ispobock Feb 12, 2025 Maintainer

siddartha-RE Feb 12, 2025 Author

siddartha-RE Feb 12, 2025 Author

siddartha-RE
Feb 11, 2025

Replies: 1 comment 4 replies

ispobock
Feb 12, 2025
Maintainer

siddartha-RE Feb 12, 2025
Author

ispobock Feb 12, 2025
Maintainer

siddartha-RE Feb 12, 2025
Author

siddartha-RE Feb 12, 2025
Author