[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) #3888

HandH1998 · 2025-02-26T10:54:39Z

Motivation

Support channel-wise INT8 quantization for DeepSeek V3/R1.
INT8 is a friendly type for most hardware platforms.

Modifications

Moe: Fused moe triton kernel supports channel-wise int8 quantization.
Norm Linear: Use the cutlass implementation of w8a8 int8.
Quantization config: Support W8A8Int8MoEMethod in w8a8 int8 config.
Unit test: Add unit test of channel-wise int8 fused moe triton kernel.
Available weights: The channel-wise int8 weights of DeepSeek-R1 is released at https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8.

Performance

Same with #3730, we also did the benchmark on A100*32 with TP exclusively. We observe no accuracy loss and up to 50% throughput enhancement. If we change the accumulator dtype of fused moe triton kernel from fp32 to int32, it will bring a little accuracy loss but will further enhance the throughput by 78%.

Model	Config	Accuracy (GSM8K)	Accuracy (MMLU)	Output Throughputs (QPS=128)
BF16 R1	A100 TP32	95.5	87.1	3342.29
Channel-INT8 R1 (FP32 accumulator)	(A100 TP16) * 2	95.8 (+0.3)	87.2 (+0.1)	5035.82 = (2517.91 x 2) (+50.6%)
Channel-INT8 R1 (INT32 accumulator)	(A100 TP16) * 2	95.1 (-0.4)	86.9 (-0.2)	5957.48 = (2978.74 x 2) (+78.2%)

NOTE:

16 A100s are sufficient for INT8 deployment and thus we assume two servers are launched for fair comaparison. Without load balance, we simply estimate their total throughputs by doubling the performance of one server.

Reproduce

Launch

remind to add --quantization w8a8_int8

# node0
SGLANG_IS_FLASHINFER_AVAILABLE=false python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Channel-INT8 --tp 16 --dist-init-addr \
HEAD_IP:5000 --nnodes 2 --node-rank 0 --trust-remote --enable-torch-compile --torch-compile-max-bs 8 \
--quantization w8a8_int8
  
# node1
SGLANG_IS_FLASHINFER_AVAILABLE=false python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Channel-INT8 --tp 16 --dist-init-addr \
HEAD_IP:5000 --nnodes 2 --node-rank 1 --trust-remote --enable-torch-compile --torch-compile-max-bs 8 \
--quantization w8a8_int8

Accuracy

# gsm8k
python3 /path/to/sglang/benchmark/gsm8k/bench_sglang.py --num-questions 1400 --parallel 200

# mmlu
python3 /path/to/sglang/benchmark/mmlu/bench_sglang.py --parallel 200

Throughput

# qps=128
python3 -m sglang.bench_serving --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random  --random-input 128 --random-output 128 --num-prompts 1000 --request-rate 128 --random-range-ratio 1.0

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: thomas-zhu-2006 <thomas-zhu-2006 thomas.zhu.2006@gmail.com >

Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: b0urnee <2769086541@qq.com>

Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: sleepcoo <sleepcoo@gmail.com>

Co-authored-by: thomas-zhu-2006 <thomas.zhu.2006@gmail.com>

yych0745 and others added 10 commits February 20, 2025 18:12

support moe int8-w8a8

bc8e18d

Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: thomas-zhu-2006 <thomas-zhu-2006 thomas.zhu.2006@gmail.com >

clean code

fb3f603

clean code

75d6b61

clean code

452cfbd

Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: b0urnee <2769086541@qq.com>

fix bug

385eb0a

resolve conflicts

67da0d5

resolve confict

6929fdd

adjust code

6592469

Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: sleepcoo <sleepcoo@gmail.com>

fused_moe kernel fall back to fp32 accumulator to reduce precision loss

b39b462

Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: sleepcoo <sleepcoo@gmail.com>

merge

e69cfe3

Co-authored-by: thomas-zhu-2006 <thomas.zhu.2006@gmail.com>

HandH1998 requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock, ByronHsu and HaiShaw as code owners February 26, 2025 10:54

Merge branch 'main' into dsv3-w8a8-channel-int8

ab98c63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) #3888

[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) #3888

HandH1998 commented Feb 26, 2025

[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) #3888

Are you sure you want to change the base?

[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) #3888

Conversation

HandH1998 commented Feb 26, 2025

Motivation

Modifications

Performance

Reproduce

Checklist