Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) #3888

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

HandH1998
Copy link
Collaborator

Motivation

Support channel-wise INT8 quantization for DeepSeek V3/R1.
INT8 is a friendly type for most hardware platforms.

Modifications

Co-author: @yych0745 @sleepcoo @b0urnee

  • Moe: Fused moe triton kernel supports channel-wise int8 quantization.
  • Norm Linear: Use the cutlass implementation of w8a8 int8.
  • Quantization config: Support W8A8Int8MoEMethod in w8a8 int8 config.
  • Unit test: Add unit test of channel-wise int8 fused moe triton kernel.
  • Available weights: The channel-wise int8 weights of DeepSeek-R1 is released at https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8.

Performance

Same with #3730, we also did the benchmark on A100*32 with TP exclusively. We observe no accuracy loss and up to 50% throughput enhancement. If we change the accumulator dtype of fused moe triton kernel from fp32 to int32, it will bring a little accuracy loss but will further enhance the throughput by 78%.

Model Config Accuracy (GSM8K) Accuracy (MMLU) Output Throughputs (QPS=128)
BF16 R1 A100 TP32 95.5 87.1 3342.29
Channel-INT8 R1 (FP32 accumulator) (A100 TP16) * 2 95.8 (+0.3) 87.2 (+0.1) 5035.82 = (2517.91 x 2) (+50.6%)
Channel-INT8 R1 (INT32 accumulator) (A100 TP16) * 2 95.1 (-0.4) 86.9 (-0.2) 5957.48 = (2978.74 x 2) (+78.2%)

NOTE:

16 A100s are sufficient for INT8 deployment and thus we assume two servers are launched for fair comaparison. Without load balance, we simply estimate their total throughputs by doubling the performance of one server.

Reproduce

  1. Launch

remind to add --quantization w8a8_int8

# node0
SGLANG_IS_FLASHINFER_AVAILABLE=false python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Channel-INT8 --tp 16 --dist-init-addr \
HEAD_IP:5000 --nnodes 2 --node-rank 0 --trust-remote --enable-torch-compile --torch-compile-max-bs 8 \
--quantization w8a8_int8
  
# node1
SGLANG_IS_FLASHINFER_AVAILABLE=false python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Channel-INT8 --tp 16 --dist-init-addr \
HEAD_IP:5000 --nnodes 2 --node-rank 1 --trust-remote --enable-torch-compile --torch-compile-max-bs 8 \
--quantization w8a8_int8
  1. Accuracy
# gsm8k
python3 /path/to/sglang/benchmark/gsm8k/bench_sglang.py --num-questions 1400 --parallel 200

# mmlu
python3 /path/to/sglang/benchmark/mmlu/bench_sglang.py --parallel 200
  1. Throughput
# qps=128
python3 -m sglang.bench_serving --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random  --random-input 128 --random-output 128 --num-prompts 1000 --request-rate 128 --random-range-ratio 1.0

Checklist

yych0745 and others added 10 commits February 20, 2025 18:12
Co-authored-by: HandH1998 <1335248067@qq.com>
Co-authored-by: thomas-zhu-2006 <thomas-zhu-2006  thomas.zhu.2006@gmail.com >
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: b0urnee <2769086541@qq.com>
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: thomas-zhu-2006 <thomas.zhu.2006@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants