[WIP] gemm block quantization for llm decoder style by nihui · Pull Request #6439 · Tencent/ncnn

nihui · 2025-12-04T11:23:05Z

./ncnnllm2int468 qwen3_decoder.ncnn.param qwen3_decoder.ncnn.bin qwen3_decoder-int6.ncnn.param qwen3_decoder-int6.ncnn.bin

tencent-adm · 2025-12-04T11:23:30Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov-commenter · 2025-12-04T11:26:06Z

Codecov Report

❌ Patch coverage is 7.52688% with 86 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.18%. Comparing base (3fab65f) to head (13e8e09).

Files with missing lines	Patch %	Lines
src/layer/gemm.cpp	7.52%	86 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6439      +/-   ##
==========================================
- Coverage   93.22%   93.18%   -0.04%     
==========================================
  Files         844      847       +3     
  Lines      266236   266651     +415     
==========================================
+ Hits       248196   248478     +282     
- Misses      18040    18173     +133

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-12-04T11:44:19Z

The binary size change of libncnn.so (bytes)

architecture	base size	pr size	difference
x86_64	15316400	15324592	+8192 ⚠️
armhf	6229892	6234020	+4128 ⚠️
aarch64	9527616	9527536	-80 😘

Copilot

Pull request overview

This WIP pull request implements block quantization for GEMM layers to support 4-bit, 6-bit, and 8-bit quantization for LLM decoder-style models. The changes introduce a new quantization tool and corresponding dequantization logic in the GEMM layer implementation.

Key changes:

New ncnnllm2int468 tool for quantizing GEMM weight matrices with configurable block sizes and bit widths
Block-based quantization scheme using per-block scaling factors stored in B_data_quantize_scales
Dequantization logic in gemm.cpp that converts quantized weights back to fp32 during model loading

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
tools/quantize/ncnnllm2int468.cpp	New quantization tool implementing 4/6/8-bit block quantization with custom bit-packed storage formats
tools/quantize/CMakeLists.txt	Build configuration adding the new ncnnllm2int468 executable
tools/modelwriter.h	Extended serialization to save block quantization scales for int8_scale_term values 4/5/6
src/layer/gemm.h	Added B_data_quantize_scales member to store per-block scaling factors
src/layer/gemm.cpp	Implemented loading and dequantization logic for 4/6/8-bit block-quantized weights

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/layer/gemm.cpp

Copilot · 2025-12-04T11:51:27Z

tools/quantize/CMakeLists.txt

+add_executable(ncnnllm2int468 ncnnllm2int468.cpp)
+target_link_libraries(ncnnllm2int468 PRIVATE ncnn)


The new ncnnllm2int468 executable is not added to the virtual project group or installed via ncnn_install_tool(), unlike ncnn2int8 above. This creates inconsistency in how tools are organized and installed.

src/layer/gemm.cpp

Copilot · 2025-12-04T11:51:29Z

tools/quantize/ncnnllm2int468.cpp

+            union i6x4_t
+            {
+                signed char i6[3];
+                struct
+                {
+                    signed char i6_a : 6;
+                    signed char i6_b : 6;
+                    signed char i6_c : 6;
+                    signed char i6_d : 6;
+                } __attribute__((packed));
+            };


The i6x4_t union definition is duplicated in both the quantization tool and the dequantization code in gemm.cpp (lines 193-203). Consider moving this to a shared header to maintain consistency and avoid duplication.

src/layer/gemm.cpp

Copilot · 2025-12-04T11:51:30Z

src/layer/gemm.cpp

+                    {
+                        signed char i4_low : 4;
+                        signed char i4_high : 4;
+                    } __attribute__((packed));


The __attribute__((packed)) attribute is GCC-specific and not portable. This will fail on MSVC. Consider using #pragma pack for cross-platform compatibility or conditionally compile based on compiler.

Copilot · 2025-12-04T11:51:30Z

tools/quantize/ncnnllm2int468.cpp

+            union i4x2_t
+            {
+                signed char i4;
+                struct
+                {
+                    signed char i4_low : 4;
+                    signed char i4_high : 4;
+                } __attribute__((packed));
+            };


The i4x2_t union definition is duplicated in both the quantization tool and the dequantization code in gemm.cpp (lines 264-272). Consider moving this to a shared header to maintain consistency and avoid duplication.

Copilot · 2025-12-04T11:51:30Z

tools/quantize/ncnnllm2int468.cpp

+    const int block_size = 64; // FIXME hardcode
+    // const int nbits = 8; // FIXME hardcode
+    const int nbits = 6; // FIXME hardcode


The block_size value of 64 is hardcoded here and also duplicated in gemm.cpp (lines 143, 183, 254). Consider making it a named constant or passing it as a configurable parameter to avoid inconsistencies if this value needs to change.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

gemm block quantization for llm decoder style

17b7109

github-actions bot added tool layer labels Dec 4, 2025

This was referenced Dec 4, 2025

ncnn是否考虑支持权重量化 Does ncnn consider supporting weight quantization #6112

Open

feature plan #6321

Open

nihui requested a review from Copilot December 4, 2025 11:44

Copilot started reviewing on behalf of nihui December 4, 2025 11:44 View session

Copilot finished reviewing on behalf of nihui December 4, 2025 11:48

Copilot AI reviewed Dec 4, 2025

View reviewed changes

nihui and others added 9 commits December 4, 2025 19:52

Update src/layer/gemm.cpp

9c22a28

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/layer/gemm.cpp

edbb1aa

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/layer/gemm.cpp

6fd6909

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/layer/gemm.cpp

4d636ba

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/layer/gemm.cpp

2dfb93c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/layer/gemm.cpp

bdc9f2d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/layer/gemm.cpp

0cc494f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'master' into block-quantization-infrastructure

e9c4943

Merge branch 'master' into block-quantization-infrastructure

58cc1f3

dosubot bot mentioned this pull request Jan 15, 2026

ncnn KV cache mem opt ncnn 的 KVcache 模式优化 #6512

Open

Merge branch 'Tencent:master' into block-quantization-infrastructure

13e8e09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] gemm block quantization for llm decoder style#6439

[WIP] gemm block quantization for llm decoder style#6439
nihui wants to merge 11 commits intoTencent:masterfrom
nihui:block-quantization-infrastructure

nihui commented Dec 4, 2025 •

edited

Loading

Uh oh!

tencent-adm commented Dec 4, 2025

Uh oh!

codecov-commenter commented Dec 4, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 4, 2025

Uh oh!

Uh oh!

Copilot AI Dec 4, 2025

Uh oh!

Uh oh!

Copilot AI Dec 4, 2025

Uh oh!

Copilot AI Dec 4, 2025

Uh oh!

Copilot AI Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		add_executable(ncnnllm2int468 ncnnllm2int468.cpp)
		target_link_libraries(ncnnllm2int468 PRIVATE ncnn)

Conversation

nihui commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tencent-adm commented Dec 4, 2025

Uh oh!

codecov-commenter commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nihui commented Dec 4, 2025 •

edited

Loading

codecov-commenter commented Dec 4, 2025 •

edited

Loading

github-actions bot commented Dec 4, 2025 •

edited

Loading