[WIP] gemm block quantization for llm decoder style#6439
[WIP] gemm block quantization for llm decoder style#6439nihui wants to merge 11 commits intoTencent:masterfrom
Conversation
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6439 +/- ##
==========================================
- Coverage 93.22% 93.18% -0.04%
==========================================
Files 844 847 +3
Lines 266236 266651 +415
==========================================
+ Hits 248196 248478 +282
- Misses 18040 18173 +133 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
The binary size change of libncnn.so (bytes)
|
There was a problem hiding this comment.
Pull request overview
This WIP pull request implements block quantization for GEMM layers to support 4-bit, 6-bit, and 8-bit quantization for LLM decoder-style models. The changes introduce a new quantization tool and corresponding dequantization logic in the GEMM layer implementation.
Key changes:
- New
ncnnllm2int468tool for quantizing GEMM weight matrices with configurable block sizes and bit widths - Block-based quantization scheme using per-block scaling factors stored in
B_data_quantize_scales - Dequantization logic in
gemm.cppthat converts quantized weights back to fp32 during model loading
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/quantize/ncnnllm2int468.cpp | New quantization tool implementing 4/6/8-bit block quantization with custom bit-packed storage formats |
| tools/quantize/CMakeLists.txt | Build configuration adding the new ncnnllm2int468 executable |
| tools/modelwriter.h | Extended serialization to save block quantization scales for int8_scale_term values 4/5/6 |
| src/layer/gemm.h | Added B_data_quantize_scales member to store per-block scaling factors |
| src/layer/gemm.cpp | Implemented loading and dequantization logic for 4/6/8-bit block-quantized weights |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| add_executable(ncnnllm2int468 ncnnllm2int468.cpp) | ||
| target_link_libraries(ncnnllm2int468 PRIVATE ncnn) |
There was a problem hiding this comment.
The new ncnnllm2int468 executable is not added to the virtual project group or installed via ncnn_install_tool(), unlike ncnn2int8 above. This creates inconsistency in how tools are organized and installed.
| union i6x4_t | ||
| { | ||
| signed char i6[3]; | ||
| struct | ||
| { | ||
| signed char i6_a : 6; | ||
| signed char i6_b : 6; | ||
| signed char i6_c : 6; | ||
| signed char i6_d : 6; | ||
| } __attribute__((packed)); | ||
| }; |
There was a problem hiding this comment.
The i6x4_t union definition is duplicated in both the quantization tool and the dequantization code in gemm.cpp (lines 193-203). Consider moving this to a shared header to maintain consistency and avoid duplication.
| { | ||
| signed char i4_low : 4; | ||
| signed char i4_high : 4; | ||
| } __attribute__((packed)); |
There was a problem hiding this comment.
The __attribute__((packed)) attribute is GCC-specific and not portable. This will fail on MSVC. Consider using #pragma pack for cross-platform compatibility or conditionally compile based on compiler.
| union i4x2_t | ||
| { | ||
| signed char i4; | ||
| struct | ||
| { | ||
| signed char i4_low : 4; | ||
| signed char i4_high : 4; | ||
| } __attribute__((packed)); | ||
| }; |
There was a problem hiding this comment.
The i4x2_t union definition is duplicated in both the quantization tool and the dequantization code in gemm.cpp (lines 264-272). Consider moving this to a shared header to maintain consistency and avoid duplication.
| const int block_size = 64; // FIXME hardcode | ||
| // const int nbits = 8; // FIXME hardcode | ||
| const int nbits = 6; // FIXME hardcode |
There was a problem hiding this comment.
The block_size value of 64 is hardcoded here and also duplicated in gemm.cpp (lines 143, 183, 254). Consider making it a named constant or passing it as a configurable parameter to avoid inconsistencies if this value needs to change.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Uh oh!
There was an error while loading. Please reload this page.