Releases: ModelCloud/GPTQModel
GPTQModel v1.9.0
What's Changed
⚡ Offload tokenizer fixes to Toke(n)icer pkg.
⚡ Optimized lm_head
quant time and vram usage.
⚡ Optimized DeekSeek v3/R1
model quant vram usage.
⚡ 3x speed-up for Torch kernel when using Pytorch >= 2.5.0 with model.compile().
⚡ New calibration_dataset_concat_size
option to enable calibration data concat mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like wikitext2.
🐛 Fixed Optimum compat and XPU
/IPEX
auto kernel selection regresion in v1.8.1
- Fix init arg order and
optimum
compat by @CSY-ModelCloud in #1240 - [FIX][Optimize] lm_head quantize by @ZX-ModelCloud in #1239
- [Model] [DeepSpeek] un-merge
gate_proj
andup_proj
by @LRL-ModelCloud in #1241 - Use Toke(n)icer by @CL-ModelCloud in #1242
#1244 - Add Tokenicer Test by @CL-ModelCloud in #1245
- prepare for 1.8.2 release by @Qubitium in #1243
- simplify calls to tokenicer by @CL-ModelCloud in #1246
- Update requirements.txt by @Qubitium in #1248
- fix trust_remote was lost by @CSY-ModelCloud in #1249
- fix trust_remote was lost by @CSY-ModelCloud in #1250
- prepare for 1.8.5 release by @Qubitium in #1251
- fix unit tests & tweak logic for selecting backends by @CSY-ModelCloud in #1253
- install tokenicer form git & do ruff by @CSY-ModelCloud in #1254
- fix k,v is not a dict by @CSY-ModelCloud in #1255
- fix not enough values to unpack (expected 2, got 1) by @CSY-ModelCloud in #1256
- fix sglang test requires numpy<2.0 by @CSY-ModelCloud in #1258
- fix ipex backend by @jiqing-feng in #1259
- ipex should be packable, reverted pr #1259 importer.py changes by @CSY-ModelCloud in #1260
- remove sentencepiece by @CSY-ModelCloud in #1261
- speed up torch dequantize by @Qubitium in #1262
- Add
calibration_dataset_concat_size
option/mode by @LRL-ModelCloud in #1257 - add transformers test by @CSY-ModelCloud in #1264
- Add kernel torch.compile hook by @Qubitium in #1265
- [FIX]fix vl model prepare_dataset by @LRL-ModelCloud in #1266
Full Changelog: v1.8.1...v1.9.0
GPTQModel v1.8.1
What's Changed
⚡ DeekSeek v3/R1
model support.
⚡ New flexible weight packing
: allow quantized weights to be packed to [int32, int16, int8]
dtypes. Triton and Torch kernels supports full range of new QuantizeConfig.pack_dtype.
⚡ Over 50% speedup for vl
model quantization (Qwen 2.5-VL + Ovis)
⚡ New auto_gc: bool
control in quantize()
which can reduce quantization time for small model with no chance of oom.
⚡ New GPTQModel.push_to_hub()
api for easy quant model upload to HF repo.
⚡ New buffered_fwd: bool
control in model.quantize().
🐛 Fixed bits=3
packing and group_size=-1
regression in v1.7.4.
🐛 Fixed Google Colab install requiring two install passes
🐛 Fixed Python 3.10 compatibility
- Flexible Pack DType by @Qubitium in #1158
- cuda needs to declare pack dtypes by @Qubitium in #1169
- fix pass pack dtype by @Qubitium in #1172
- Pass dtype by @Qubitium in #1173
- move in/out features and grop_size init to base by @Qubitium in #1174
- move self.maxq to base class by @Qubitium in #1175
- consolidate pack() into packer cls by @Qubitium in #1176
- Add
pack_dtype
to dynamic config and fix validate by @Qubitium in #1178 - Refract 4 by @Qubitium in #1180
- Refractor and simplify multi-kernel selection/init by @Qubitium in #1183
- Update/Refractor Bitblas/Marlin/Cuda by @Qubitium in #1184
- push bitblas logic down by @Qubitium in #1185
- Revert Bitblas to 0.0.1-dev13 by @Qubitium in #1186
- Do not export config.key if value is None by @Qubitium in #1187
- Fix examples/perplexity by @Qubitium in #1191
- [MODEL] add deepseek v3 support by @LRL-ModelCloud in #1127
- Push register buffer down to base class and rename all in/out features by @Qubitium in #1193
- Fix #1196 hf_transfer not accepting
max_memory
arg by @Qubitium in #1197 - reduce peak memory and reduce quant time by @Qubitium in #1198
- skip zero math by @Qubitium in #1199
- fix test_packing_speed by @Qubitium in #1202
- Update test_quant_time.py by @Qubitium in #1203
- experimental
buffered_fwd
quantize control by @Qubitium in #1205 - Fix dynamic regression on quant save by @Qubitium in #1208
- Python 3.10 type-hint compt bug by @Qubitium in #1213
- Fix colab install by @Qubitium in #1215
- add
GPTQModel.push_to_hub()
support by @Qubitium in #1216 - default to 8GB shard-size for model save by @Qubitium in #1217
- Auto gc toggle by @Qubitium in #1219
- fix 3bit packing and inference by @Qubitium in #1218
- fix merge error by @CSY-ModelCloud in #1234
- fix var name by @CSY-ModelCloud in #1235
- fix visual llm slow forward by @LRL-ModelCloud in #1232
Full Changelog: v1.7.4...v1.8.1
GPTQModel v1.8.0
What's Changed
⚡ DeekSeek v3/R1
model support.
⚡ New flexible weight packing
: allow quantized weights to be packed to [int32, int16, int8]
dtypes. Triton and Torch kernels supports full range of new QuantizeConfig.pack_dtype.
⚡ New auto_gc: bool
control in quantize()
which can reduce quantization time for small model with no chance of oom.
⚡ New GPTQModel.push_to_hub()
api for easy quant model to HF repo.
⚡ New buffered_fwd: bool
control in model.quantize().
🐛 Fixed bits=3
packing regression in v1.7.4.
🐛 Fixed Google Colab install requiring two install passes
🐛 Fixed Python 3.10 compatibility
- start 1.8.0-dev cycle by @Qubitium in #1168
- Flexible Pack DType by @Qubitium in #1158
- cuda needs to declare pack dtypes by @Qubitium in #1169
- fix pass pack dtype by @Qubitium in #1172
- Pass dtype by @Qubitium in #1173
- move in/out features and grop_size init to base by @Qubitium in #1174
- move self.maxq to base class by @Qubitium in #1175
- consolidate pack() into packer cls by @Qubitium in #1176
- Add
pack_dtype
to dynamic config and fix validate by @Qubitium in #1178 - format by @Qubitium in #1179
- Refract 4 by @Qubitium in #1180
- Refractor and simplify multi-kernel selection/init by @Qubitium in #1183
- Update/Refractor Bitblas/Marlin/Cuda by @Qubitium in #1184
- push bitblas logic down by @Qubitium in #1185
- Revert Bitblas to 0.0.1-dev13 by @Qubitium in #1186
- Do not export config.key if value is None by @Qubitium in #1187
- Fix examples/perplexity by @Qubitium in #1191
- [MODEL] add deepseek v3 support by @LRL-ModelCloud in #1127
- Push register buffer down to base class and rename all in/out features by @Qubitium in #1193
- Fix #1196 hf_transfer not accepting
max_memory
arg by @Qubitium in #1197 - reduce peak memory and reduce quant time by @Qubitium in #1198
- skip zero math by @Qubitium in #1199
- fix test_packing_speed by @Qubitium in #1202
- Update test_quant_time.py by @Qubitium in #1203
- experimental
buffered_fwd
quantize control by @Qubitium in #1205 - Fix dynamic regression on quant save by @Qubitium in #1208
- Python 3.10 type-hint compt bug by @Qubitium in #1213
- Fix colab install by @Qubitium in #1215
- add
GPTQModel.push_to_hub()
support by @Qubitium in #1216 - default to 8GB shard-size for model save by @Qubitium in #1217
- Auto gc toggle by @Qubitium in #1219
- fix 3bit packing and inference by @Qubitium in #1218
Full Changelog: v1.7.4...v1.8.0
GPTQModel v1.7.4
What's Changed
⚡ Faster packing
for post-quantization model weight save.
⚡ Triton
kernel now validated for Intel/XPU
when Intel Triton package is installed.
⚡ New compile()
api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate bpw
calculations.
🐛 Fix ROCm
compile with setup.py
- Fix exllama slow pack() by @CSY-ModelCloud in #1128
- use optimized torch.round() codes by @CSY-ModelCloud in #1131
- fix shape mismatch for packing by @CSY-ModelCloud in #1132
- Speed up triton dequant by @Qubitium in #1136
- add torch compile with backend aot_ts by @CSY-ModelCloud in #1139
- disable sampling by @Qubitium in #1141
- mod triton-xpu by @CL-ModelCloud in #1135
- supress dynamo error by @CSY-ModelCloud in #1143
- fix bpw by @CL-ModelCloud in #1150
- [FIX] fix incorrectly saved the slow tokenizer by @LRL-ModelCloud in #1151
- Add mod chat by @CL-ModelCloud in #1154
- optimize pack by @Qubitium in #1153
- add quant time test by @CL-ModelCloud in #1155
- Export to hf model by @LRL-ModelCloud in #1157
- Fix bpw calculation by @Qubitium in #1163
- Inference speed test by @CL-ModelCloud in #1159
New Contributors
Full Changelog: v1.7.3...v1.7.4
GPTQModel v1.7.3
What's Changed
⚡ Telechat2 (China Telecom) model support
⚡ PhiMoE model support
🐛 Fix lm_head weights duplicated in post-quantize save() for models with tied-embedding.
- Add util.tensor_parameters() by @ZX-ModelCloud in #1107
- add require_dtype by @LRL-ModelCloud in #1109
- [MODEL] Add Telechat2 (China Telecom) by @1096125073 in #1106
- [FIX] Filter weight-sharing tensors when save by @ZX-ModelCloud in #1112
- Add telechat test by @LRL-ModelCloud in #1111
- [FIX] fix convert_gptq_to_mlx_weights by @LRL-ModelCloud in #1113
- add test_parameter_count.py by @ZX-ModelCloud in #1115
- Add gpqa eval task by @CL-ModelCloud in #1117
- [FIX] Call tied_weights() after load_checkpoint_in_model() by @ZX-ModelCloud in #1119
- add phimoe support by @CSY-ModelCloud in #1118
New Contributors
- @1096125073 made their first contribution in #1106
Full Changelog: v1.7.2...v1.7.3
GPTQModel v1.7.2
What's Changed
⚡Effective BPW (bits per weight) will now be logged during load().
⚡Reduce loading time on Intel Arc A770/B580 XPU by 3.3x.
⚡Reduce memory usage in MLX conversion.
🐛 Fix Marlin kernel auto-select not checking CUDA compute version.
- remove catching module error by @CSY-ModelCloud in #1088
- [FIX] monkey patch GPTQShuffle.convert_idx to use fixed convert_idx by @LRL-ModelCloud in #1090
- [FIX] monkey patch only once by @LRL-ModelCloud in #1091
- check CC >= 8 for marlin, fixed #1092 by @CSY-ModelCloud in #1093
- check compute capability for marlin in validate_device() by @CSY-ModelCloud in #1095
- torch get device with index of CUDA_VISIBLE_DEVICES, not value of it by @CSY-ModelCloud in #1096
- fix local model path & marlin test by @CSY-ModelCloud in #1097
- mod bits info by @CL-ModelCloud in #1100
- Reduce memory usage in mlx conversion by @Qubitium in #1099
- cleanup mlx code by @Qubitium in #1101
Full Changelog: v1.7.0...v1.7.2
GPTQModel v1.7.0
What's Changed
⚡backend.MLX
added for runtime-conversion and execution of GPTQ models on Apple's MLX
framework on Apple Silicon (M1+). ⚡ Exports of gptq models to mlx also now possible. We have added mlx exported models to huggingface.co/ModelCloud.
⚡ lm_head quantization now fully support by GPTQModel without external pkg dependency.
🐛 Fixed setup.py
not correctly detecting incompatible setuptools
/wheel
pkgs.
- [CI] run tests with linux tag by @CSY-ModelCloud in #1067
- Add backend.MLX by @LRL-ModelCloud in #1061
- add mlx generate test by @CL-ModelCloud in #1069
- [CI] upload source in build step by @CSY-ModelCloud in #1070
- code review by @CL-ModelCloud in #1072
- [CI] install mlx by @CSY-ModelCloud in #1071
- Add option to quantize
lm_head
by @ZX-ModelCloud in #1037 - fix test_packing by @LRL-ModelCloud in #1073
- [CI] add mlx test by @CSY-ModelCloud in #1074
- [CI] fix ci relase env name by @CSY-ModelCloud in #1078
- update mlx test by @CSY-ModelCloud in #1079
- convert to mlx support desc_act true by @LRL-ModelCloud in #1082
- [CI] add extra-index-url for pip install by @CSY-ModelCloud in #1083
- catch module error for setup.py by @CSY-ModelCloud in #1084
Full Changelog: v1.6.1...v1.7.0
GPTQModel v1.6.1
What's Changed
🎉 New OpenAI api compatible end-point via model.serve(host, port)
.
⚡ Auto-enable flash-attention2 for inference.
🐛 Fixed sym=False
loading regression.
- code opt by @CL-ModelCloud in #1038
- fix marlin validate rocm & do validate() if backend not AUTO by @CSY-ModelCloud in #1040
- add global rocm check by @CSY-ModelCloud in #1043
- [FIX] pass sym to make_quant by @LRL-ModelCloud in #1046
- enable flash attn for loading quantized by @CSY-ModelCloud in #1045
- add flash_attn2 test by @CSY-ModelCloud in #1047
- enable flash_attention only when device is cuda by @CSY-ModelCloud in #1050
- move flash attn test to correct folder by @CSY-ModelCloud in #1052
- Expose openai server api by @CL-ModelCloud in #1048
- update openai server by @CL-ModelCloud in #1058
- don't download whl for xpu env by @CSY-ModelCloud in #1059
- remove build tag for normal release by @CSY-ModelCloud in #1063
- disable flash attn 2 for internlm by @CSY-ModelCloud in #1065
Full Changelog: v1.6.0...v1.6.1
GPTQModel v1.6.0
What's Changed
⚡ 25% faster quantization. 35% reduction in vram usage vs v1.5. 👀
🎉 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU.
💫 Auto-tokenizer loader via load() api. For most models you no longer need to manually init a tokenizer for both inference and quantization.
- note about
batch_size
to speed up quant by @Qubitium in #992 - Add ROCm support by @CSY-ModelCloud in #993
- Add bits test by @ZX-ModelCloud in #995
- note about rocm support by @Qubitium in #998
- [FIX] wrong variable name by @ZX-ModelCloud in #997
- update rocm version tag by @CSY-ModelCloud in #999
- Auto-tokenizer will be called within
load()
by @LRL-ModelCloud in #996 - update transformers by @Qubitium in #1001
- [FIX] torch qlinear forward by @ZX-ModelCloud in #1002
- cleanup marlin info by @Qubitium in #1004
- Use custom forward hook by @LRL-ModelCloud in #1003
- fix hooked linear init by @LRL-ModelCloud in #1011
- add HookedConv1D by @LRL-ModelCloud in #1012
- record fwd time by @LRL-ModelCloud in #1013
- add PYTORCH_CUDA_ALLOC_CONF for global & do ruff by @CSY-ModelCloud in #1015
- [FIX] quantize_config could not read from config.json by @ZX-ModelCloud in #1022
- Fix quant time by @LRL-ModelCloud in #1025
- fix forward hook by @LRL-ModelCloud in #1027
- Fix hooked conv2d by @LRL-ModelCloud in #1030
- clean cache by @CL-ModelCloud in #1032
Full Changelog: v1.5.1...v1.6.0
GPTQModel v1.5.1
What's Changed
🎉 2025!
⚡ Added QuantizeConfig.device
to clearly define which device is used for quantization: default = auto
. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device
during quantization to minimize vram usage.
💫 Improve QuantLinear
selection from optimum
.
🐛 Fix attn_implementation_autoset
compat in latest transformers.
- Add QuantizeConfig.device and use. by @Qubitium in #950
- fix hf_select_quant_linear by @LRL-ModelCloud in #966
- update vllm gptq_marlin code by @ZX-ModelCloud in #967
- fix cuda:0 not a enum device by @CSY-ModelCloud in #968
- fix marlin info for non-cuda device by @Qubitium in #972
- fix backend str bug by @CL-ModelCloud in #973
- hf select quant_linear with pack by @LRL-ModelCloud in #969
- remove auto select BACKEND.IPEX by @CSY-ModelCloud in #975
- fix autoround received a device_map by @CSY-ModelCloud in #976
- use enum instead of magic number by @CSY-ModelCloud in #979
- use new ci docker images by @CSY-ModelCloud in #980
- fix flash attntion was auto loaded on cpu for pretrained model by @CSY-ModelCloud in #981
- fix old transformer doesn't have _attn_implementation_autoset by @CSY-ModelCloud in #982
- fix gptbigcode test temporally by @CSY-ModelCloud in #983
- fix version parsing by @CSY-ModelCloud in #985
Full Changelog: v1.5.0...v1.5.1