Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update FP8 kernel configuration for 4xGPU support on AMD #3850

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Eliovp
Copy link

@Eliovp Eliovp commented Feb 25, 2025

Motivation

This PR aims to restore and improve FP8 quantization kernel functionality (specific to the Deepseek R1 model) on systems using 4 GPUs, especially on AMD GPUs. The changes ensure that the proper configuration files are selected based on the number of available GPUs, and that the default settings for the matrix multiplication function are adjusted to better support MFMA instructions on lower GPU counts.

Modifications

  • Added GPU Count Helper: Introduced a new helper function get_num_gpus() to dynamically determine the number of GPUs using torch.cuda.device_count().
  • Updated Config File Selection: Modified get_w8a8_block_fp8_configs to adjust the JSON configuration file naming logic when using HIP and fewer than 8 GPUs.
  • Adjusted Default FP8 MatMul Configuration: Revised w8a8_block_fp8_matmul to provide an alternative configuration for AMD GPUs:
    • For systems with 4 or fewer GPUs, the block sizes and stage count are reduced to ensure compatibility with MFMA instructions.
    • For other cases, the original configuration remains unchanged.
  • Enhanced Compatibility: These updates make it possible to inference the Deepseek R1 model on systems with 4 GPUs without compromising performance on other setups.

Example

HSA_NO_SCRATCH_RECLAIM=1 HIP_VISIBLE_DEVICES=4,5,6,7 python3 -m sglang.bench_offline_throughput --model-path deepseek-ai/DeepSeek-R1 --tp 4 --num-prompts 10 --trust-remote-code

====== Offline Throughput Benchmark Result =======
Backend:                                 engine
Successful requests:                     10
Benchmark duration (s):                  24.74
Total input tokens:                      1972
Total generated tokens:                  2784
Request throughput (req/s):              0.40
Input token throughput (tok/s):          79.71
Output token throughput (tok/s):         112.54
Total token throughput (tok/s):          192.25
==================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant