[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) #3888
+362
−20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Support channel-wise INT8 quantization for DeepSeek V3/R1.
INT8 is a friendly type for most hardware platforms.
Modifications
Co-author: @yych0745 @sleepcoo @b0urnee
W8A8Int8MoEMethod
in w8a8 int8 config.Performance
Same with #3730, we also did the benchmark on A100*32 with TP exclusively. We observe no accuracy loss and up to 50% throughput enhancement. If we change the accumulator dtype of fused moe triton kernel from
fp32
toint32
, it will bring a little accuracy loss but will further enhance the throughput by 78%.NOTE:
16 A100s are sufficient for INT8 deployment and thus we assume two servers are launched for fair comaparison. Without load balance, we simply estimate their total throughputs by doubling the performance of one server.
Reproduce
remind to add
--quantization w8a8_int8
Checklist