Releases · ModelCloud/GPTQModel

12 Feb 09:34

Qubitium

v1.9.0

599e5c7

GPTQModel v1.9.0 Latest

Latest

What's Changed

⚡ Offload tokenizer fixes to Toke(n)icer pkg.
⚡ Optimized lm_head quant time and vram usage.
⚡ Optimized DeekSeek v3/R1 model quant vram usage.
⚡ 3x speed-up for Torch kernel when using Pytorch >= 2.5.0 with model.compile().
⚡ New calibration_dataset_concat_size option to enable calibration data concat mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like wikitext2.
🐛 Fixed Optimum compat and XPU/IPEX auto kernel selection regresion in v1.8.1

Fix init arg order and optimum compat by @CSY-ModelCloud in #1240
[FIX][Optimize] lm_head quantize by @ZX-ModelCloud in #1239
[Model] [DeepSpeek] un-merge gate_proj and up_proj by @LRL-ModelCloud in #1241
Use Toke(n)icer by @CL-ModelCloud in #1242
#1244
Add Tokenicer Test by @CL-ModelCloud in #1245
prepare for 1.8.2 release by @Qubitium in #1243
simplify calls to tokenicer by @CL-ModelCloud in #1246
Update requirements.txt by @Qubitium in #1248
fix trust_remote was lost by @CSY-ModelCloud in #1249
fix trust_remote was lost by @CSY-ModelCloud in #1250
prepare for 1.8.5 release by @Qubitium in #1251
fix unit tests & tweak logic for selecting backends by @CSY-ModelCloud in #1253
install tokenicer form git & do ruff by @CSY-ModelCloud in #1254
fix k,v is not a dict by @CSY-ModelCloud in #1255
fix not enough values to unpack (expected 2, got 1) by @CSY-ModelCloud in #1256
fix sglang test requires numpy<2.0 by @CSY-ModelCloud in #1258
fix ipex backend by @jiqing-feng in #1259
ipex should be packable, reverted pr #1259 importer.py changes by @CSY-ModelCloud in #1260
remove sentencepiece by @CSY-ModelCloud in #1261
speed up torch dequantize by @Qubitium in #1262
Add calibration_dataset_concat_size option/mode by @LRL-ModelCloud in #1257
add transformers test by @CSY-ModelCloud in #1264
Add kernel torch.compile hook by @Qubitium in #1265
[FIX]fix vl model prepare_dataset by @LRL-ModelCloud in #1266

Full Changelog: v1.8.1...v1.9.0

Contributors

Qubitium, jiqing-feng, and 4 other contributors

Assets 60

gptqmodel-1.9.0+cu118torch2.0-cp310-cp310-linux_x86_64.whl

32.3 MB 2025-02-12T10:08:08Z
gptqmodel-1.9.0+cu118torch2.0-cp311-cp311-linux_x86_64.whl

32.3 MB 2025-02-12T10:11:40Z
gptqmodel-1.9.0+cu118torch2.0-cp39-cp39-linux_x86_64.whl

32.3 MB 2025-02-12T10:08:59Z
gptqmodel-1.9.0+cu118torch2.1-cp310-cp310-linux_x86_64.whl

32.4 MB 2025-02-12T10:10:49Z
gptqmodel-1.9.0+cu118torch2.1-cp311-cp311-linux_x86_64.whl

32.4 MB 2025-02-12T10:12:26Z
gptqmodel-1.9.0+cu118torch2.1-cp39-cp39-linux_x86_64.whl

32.3 MB 2025-02-12T10:11:26Z
gptqmodel-1.9.0+cu118torch2.2-cp310-cp310-linux_x86_64.whl

32.1 MB 2025-02-12T10:34:34Z
gptqmodel-1.9.0+cu118torch2.2-cp311-cp311-linux_x86_64.whl

32.2 MB 2025-02-12T10:28:48Z
gptqmodel-1.9.0+cu118torch2.2-cp312-cp312-linux_x86_64.whl

32.2 MB 2025-02-12T10:31:06Z
gptqmodel-1.9.0+cu118torch2.2-cp39-cp39-linux_x86_64.whl

32.1 MB 2025-02-12T10:09:13Z
Source code (zip)

2025-02-12T09:30:26Z
Source code (tar.gz)

2025-02-12T09:30:26Z

08 Feb 20:19

Qubitium

v1.8.1

63499e1

GPTQModel v1.8.1

What's Changed

⚡ DeekSeek v3/R1 model support.
⚡ New flexible weight packing: allow quantized weights to be packed to [int32, int16, int8] dtypes. Triton and Torch kernels supports full range of new QuantizeConfig.pack_dtype.
⚡ Over 50% speedup for vl model quantization (Qwen 2.5-VL + Ovis)
⚡ New auto_gc: bool control in quantize() which can reduce quantization time for small model with no chance of oom.
⚡ New GPTQModel.push_to_hub() api for easy quant model upload to HF repo.
⚡ New buffered_fwd: bool control in model.quantize().
🐛 Fixed bits=3 packing and group_size=-1 regression in v1.7.4.
🐛 Fixed Google Colab install requiring two install passes
🐛 Fixed Python 3.10 compatibility

Flexible Pack DType by @Qubitium in #1158
cuda needs to declare pack dtypes by @Qubitium in #1169
fix pass pack dtype by @Qubitium in #1172
Pass dtype by @Qubitium in #1173
move in/out features and grop_size init to base by @Qubitium in #1174
move self.maxq to base class by @Qubitium in #1175
consolidate pack() into packer cls by @Qubitium in #1176
Add pack_dtype to dynamic config and fix validate by @Qubitium in #1178
Refract 4 by @Qubitium in #1180
Refractor and simplify multi-kernel selection/init by @Qubitium in #1183
Update/Refractor Bitblas/Marlin/Cuda by @Qubitium in #1184
push bitblas logic down by @Qubitium in #1185
Revert Bitblas to 0.0.1-dev13 by @Qubitium in #1186
Do not export config.key if value is None by @Qubitium in #1187
Fix examples/perplexity by @Qubitium in #1191
[MODEL] add deepseek v3 support by @LRL-ModelCloud in #1127
Push register buffer down to base class and rename all in/out features by @Qubitium in #1193
Fix #1196 hf_transfer not accepting max_memory arg by @Qubitium in #1197
reduce peak memory and reduce quant time by @Qubitium in #1198
skip zero math by @Qubitium in #1199
fix test_packing_speed by @Qubitium in #1202
Update test_quant_time.py by @Qubitium in #1203
experimental buffered_fwd quantize control by @Qubitium in #1205
Fix dynamic regression on quant save by @Qubitium in #1208
Python 3.10 type-hint compt bug by @Qubitium in #1213
Fix colab install by @Qubitium in #1215
add GPTQModel.push_to_hub() support by @Qubitium in #1216
default to 8GB shard-size for model save by @Qubitium in #1217
Auto gc toggle by @Qubitium in #1219
fix 3bit packing and inference by @Qubitium in #1218
fix merge error by @CSY-ModelCloud in #1234
fix var name by @CSY-ModelCloud in #1235
fix visual llm slow forward by @LRL-ModelCloud in #1232

Full Changelog: v1.7.4...v1.8.1

Contributors

Qubitium, LRL-ModelCloud, and CSY-ModelCloud

Assets 52

07 Feb 17:07

Qubitium

v1.8.0

e876a49

GPTQModel v1.8.0 Pre-release

Pre-release

What's Changed

⚡ DeekSeek v3/R1 model support.
⚡ New flexible weight packing: allow quantized weights to be packed to [int32, int16, int8] dtypes. Triton and Torch kernels supports full range of new QuantizeConfig.pack_dtype.
⚡ New auto_gc: bool control in quantize() which can reduce quantization time for small model with no chance of oom.
⚡ New GPTQModel.push_to_hub() api for easy quant model to HF repo.
⚡ New buffered_fwd: bool control in model.quantize().
🐛 Fixed bits=3 packing regression in v1.7.4.
🐛 Fixed Google Colab install requiring two install passes
🐛 Fixed Python 3.10 compatibility

start 1.8.0-dev cycle by @Qubitium in #1168
Flexible Pack DType by @Qubitium in #1158
cuda needs to declare pack dtypes by @Qubitium in #1169
fix pass pack dtype by @Qubitium in #1172
Pass dtype by @Qubitium in #1173
move in/out features and grop_size init to base by @Qubitium in #1174
move self.maxq to base class by @Qubitium in #1175
consolidate pack() into packer cls by @Qubitium in #1176
Add pack_dtype to dynamic config and fix validate by @Qubitium in #1178
format by @Qubitium in #1179
Refract 4 by @Qubitium in #1180
Refractor and simplify multi-kernel selection/init by @Qubitium in #1183
Update/Refractor Bitblas/Marlin/Cuda by @Qubitium in #1184
push bitblas logic down by @Qubitium in #1185
Revert Bitblas to 0.0.1-dev13 by @Qubitium in #1186
Do not export config.key if value is None by @Qubitium in #1187
Fix examples/perplexity by @Qubitium in #1191
[MODEL] add deepseek v3 support by @LRL-ModelCloud in #1127
Push register buffer down to base class and rename all in/out features by @Qubitium in #1193
Fix #1196 hf_transfer not accepting max_memory arg by @Qubitium in #1197
reduce peak memory and reduce quant time by @Qubitium in #1198
skip zero math by @Qubitium in #1199
fix test_packing_speed by @Qubitium in #1202
Update test_quant_time.py by @Qubitium in #1203
experimental buffered_fwd quantize control by @Qubitium in #1205
Fix dynamic regression on quant save by @Qubitium in #1208
Python 3.10 type-hint compt bug by @Qubitium in #1213
Fix colab install by @Qubitium in #1215
add GPTQModel.push_to_hub() support by @Qubitium in #1216
default to 8GB shard-size for model save by @Qubitium in #1217
Auto gc toggle by @Qubitium in #1219
fix 3bit packing and inference by @Qubitium in #1218

Full Changelog: v1.7.4...v1.8.0

Contributors

Qubitium and LRL-ModelCloud

Assets 52

26 Jan 07:02

Qubitium

v1.7.4

b623b96

GPTQModel v1.7.4

What's Changed

⚡ Faster packing for post-quantization model weight save.
⚡ Triton kernel now validated for Intel/XPU when Intel Triton package is installed.
⚡ New compile() api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate bpw calculations.
🐛 Fix ROCm compile with setup.py

Fix exllama slow pack() by @CSY-ModelCloud in #1128
use optimized torch.round() codes by @CSY-ModelCloud in #1131
fix shape mismatch for packing by @CSY-ModelCloud in #1132
Speed up triton dequant by @Qubitium in #1136
add torch compile with backend aot_ts by @CSY-ModelCloud in #1139
disable sampling by @Qubitium in #1141
mod triton-xpu by @CL-ModelCloud in #1135
supress dynamo error by @CSY-ModelCloud in #1143
fix bpw by @CL-ModelCloud in #1150
[FIX] fix incorrectly saved the slow tokenizer by @LRL-ModelCloud in #1151
Add mod chat by @CL-ModelCloud in #1154
optimize pack by @Qubitium in #1153
add quant time test by @CL-ModelCloud in #1155
Export to hf model by @LRL-ModelCloud in #1157
Fix bpw calculation by @Qubitium in #1163
Inference speed test by @CL-ModelCloud in #1159

New Contributors

@isaranto made their first contribution in #1162

Full Changelog: v1.7.3...v1.7.4

Contributors

Qubitium, isaranto, and 3 other contributors

Assets 52

21 Jan 00:14

Qubitium

v1.7.3

5c1a7e8

GPTQModel v1.7.3

What's Changed

⚡ Telechat2 (China Telecom) model support
⚡ PhiMoE model support
🐛 Fix lm_head weights duplicated in post-quantize save() for models with tied-embedding.

Add util.tensor_parameters() by @ZX-ModelCloud in #1107
add require_dtype by @LRL-ModelCloud in #1109
[MODEL] Add Telechat2 (China Telecom) by @1096125073 in #1106
[FIX] Filter weight-sharing tensors when save by @ZX-ModelCloud in #1112
Add telechat test by @LRL-ModelCloud in #1111
[FIX] fix convert_gptq_to_mlx_weights by @LRL-ModelCloud in #1113
add test_parameter_count.py by @ZX-ModelCloud in #1115
Add gpqa eval task by @CL-ModelCloud in #1117
[FIX] Call tied_weights() after load_checkpoint_in_model() by @ZX-ModelCloud in #1119
add phimoe support by @CSY-ModelCloud in #1118

New Contributors

@1096125073 made their first contribution in #1106

Full Changelog: v1.7.2...v1.7.3

Contributors

1096125073, ZX-ModelCloud, and 3 other contributors

Assets 52

19 Jan 03:52

Qubitium

v1.7.2

d762379

GPTQModel v1.7.2

What's Changed

⚡Effective BPW (bits per weight) will now be logged during load().
⚡Reduce loading time on Intel Arc A770/B580 XPU by 3.3x.
⚡Reduce memory usage in MLX conversion.
🐛 Fix Marlin kernel auto-select not checking CUDA compute version.

remove catching module error by @CSY-ModelCloud in #1088
[FIX] monkey patch GPTQShuffle.convert_idx to use fixed convert_idx by @LRL-ModelCloud in #1090
[FIX] monkey patch only once by @LRL-ModelCloud in #1091
check CC >= 8 for marlin, fixed #1092 by @CSY-ModelCloud in #1093
check compute capability for marlin in validate_device() by @CSY-ModelCloud in #1095
torch get device with index of CUDA_VISIBLE_DEVICES, not value of it by @CSY-ModelCloud in #1096
fix local model path & marlin test by @CSY-ModelCloud in #1097
mod bits info by @CL-ModelCloud in #1100
Reduce memory usage in mlx conversion by @Qubitium in #1099
cleanup mlx code by @Qubitium in #1101

Full Changelog: v1.7.0...v1.7.2

Contributors

Qubitium, LRL-ModelCloud, and 2 other contributors

Assets 52

17 Jan 01:34

Qubitium

v1.7.0

d247fd0

GPTQModel v1.7.0

What's Changed

⚡backend.MLX added for runtime-conversion and execution of GPTQ models on Apple's MLX framework on Apple Silicon (M1+). ⚡ Exports of gptq models to mlx also now possible. We have added mlx exported models to huggingface.co/ModelCloud.
⚡ lm_head quantization now fully support by GPTQModel without external pkg dependency.
🐛 Fixed setup.py not correctly detecting incompatible setuptools/wheel pkgs.

[CI] run tests with linux tag by @CSY-ModelCloud in #1067
Add backend.MLX by @LRL-ModelCloud in #1061
add mlx generate test by @CL-ModelCloud in #1069
[CI] upload source in build step by @CSY-ModelCloud in #1070
code review by @CL-ModelCloud in #1072
[CI] install mlx by @CSY-ModelCloud in #1071
Add option to quantize lm_head by @ZX-ModelCloud in #1037
fix test_packing by @LRL-ModelCloud in #1073
[CI] add mlx test by @CSY-ModelCloud in #1074
[CI] fix ci relase env name by @CSY-ModelCloud in #1078
update mlx test by @CSY-ModelCloud in #1079
convert to mlx support desc_act true by @LRL-ModelCloud in #1082
[CI] add extra-index-url for pip install by @CSY-ModelCloud in #1083
catch module error for setup.py by @CSY-ModelCloud in #1084

Full Changelog: v1.6.1...v1.7.0

Contributors

ZX-ModelCloud, LRL-ModelCloud, and 2 other contributors

Assets 52

09 Jan 03:40

Qubitium

v1.6.1

0c6452b

GPTQModel v1.6.1

What's Changed

🎉 New OpenAI api compatible end-point via model.serve(host, port).
⚡ Auto-enable flash-attention2 for inference.
🐛 Fixed sym=False loading regression.

code opt by @CL-ModelCloud in #1038
fix marlin validate rocm & do validate() if backend not AUTO by @CSY-ModelCloud in #1040
add global rocm check by @CSY-ModelCloud in #1043
[FIX] pass sym to make_quant by @LRL-ModelCloud in #1046
enable flash attn for loading quantized by @CSY-ModelCloud in #1045
add flash_attn2 test by @CSY-ModelCloud in #1047
enable flash_attention only when device is cuda by @CSY-ModelCloud in #1050
move flash attn test to correct folder by @CSY-ModelCloud in #1052
Expose openai server api by @CL-ModelCloud in #1048
update openai server by @CL-ModelCloud in #1058
don't download whl for xpu env by @CSY-ModelCloud in #1059
remove build tag for normal release by @CSY-ModelCloud in #1063
disable flash attn 2 for internlm by @CSY-ModelCloud in #1065

Full Changelog: v1.6.0...v1.6.1

Contributors

LRL-ModelCloud, CL-ModelCloud, and CSY-ModelCloud

Assets 51

06 Jan 08:00

Qubitium

v1.6.0

c5c2677

GPTQModel v1.6.0

What's Changed

⚡ 25% faster quantization. 35% reduction in vram usage vs v1.5. 👀
🎉 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU.
💫 Auto-tokenizer loader via load() api. For most models you no longer need to manually init a tokenizer for both inference and quantization.

note about batch_size to speed up quant by @Qubitium in #992
Add ROCm support by @CSY-ModelCloud in #993
Add bits test by @ZX-ModelCloud in #995
note about rocm support by @Qubitium in #998
[FIX] wrong variable name by @ZX-ModelCloud in #997
update rocm version tag by @CSY-ModelCloud in #999
Auto-tokenizer will be called within load() by @LRL-ModelCloud in #996
update transformers by @Qubitium in #1001
[FIX] torch qlinear forward by @ZX-ModelCloud in #1002
cleanup marlin info by @Qubitium in #1004
Use custom forward hook by @LRL-ModelCloud in #1003
fix hooked linear init by @LRL-ModelCloud in #1011
add HookedConv1D by @LRL-ModelCloud in #1012
record fwd time by @LRL-ModelCloud in #1013
add PYTORCH_CUDA_ALLOC_CONF for global & do ruff by @CSY-ModelCloud in #1015
[FIX] quantize_config could not read from config.json by @ZX-ModelCloud in #1022
Fix quant time by @LRL-ModelCloud in #1025
fix forward hook by @LRL-ModelCloud in #1027
Fix hooked conv2d by @LRL-ModelCloud in #1030
clean cache by @CL-ModelCloud in #1032

Full Changelog: v1.5.1...v1.6.0

Contributors

Qubitium, ZX-ModelCloud, and 3 other contributors

Assets 52

01 Jan 08:39

Qubitium

v1.5.1

4f18747

GPTQModel v1.5.1

What's Changed

🎉 2025!

⚡ Added QuantizeConfig.device to clearly define which device is used for quantization: default = auto. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device during quantization to minimize vram usage.
💫 Improve QuantLinear selection from optimum.
🐛 Fix attn_implementation_autoset compat in latest transformers.

Add QuantizeConfig.device and use. by @Qubitium in #950
fix hf_select_quant_linear by @LRL-ModelCloud in #966
update vllm gptq_marlin code by @ZX-ModelCloud in #967
fix cuda:0 not a enum device by @CSY-ModelCloud in #968
fix marlin info for non-cuda device by @Qubitium in #972
fix backend str bug by @CL-ModelCloud in #973
hf select quant_linear with pack by @LRL-ModelCloud in #969
remove auto select BACKEND.IPEX by @CSY-ModelCloud in #975
fix autoround received a device_map by @CSY-ModelCloud in #976
use enum instead of magic number by @CSY-ModelCloud in #979
use new ci docker images by @CSY-ModelCloud in #980
fix flash attntion was auto loaded on cpu for pretrained model by @CSY-ModelCloud in #981
fix old transformer doesn't have _attn_implementation_autoset by @CSY-ModelCloud in #982
fix gptbigcode test temporally by @CSY-ModelCloud in #983
fix version parsing by @CSY-ModelCloud in #985

Full Changelog: v1.5.0...v1.5.1

Contributors

Qubitium, ZX-ModelCloud, and 3 other contributors

Assets 52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

Releases: ModelCloud/GPTQModel

GPTQModel v1.9.0

What's Changed

Contributors

GPTQModel v1.8.1

What's Changed

Contributors

GPTQModel v1.8.0

What's Changed

Contributors

GPTQModel v1.7.4

What's Changed

New Contributors

Contributors

GPTQModel v1.7.3

What's Changed

New Contributors

Contributors

GPTQModel v1.7.2

What's Changed

Contributors

GPTQModel v1.7.0

What's Changed

Contributors

GPTQModel v1.6.1

What's Changed

Contributors

GPTQModel v1.6.0

What's Changed

Contributors

GPTQModel v1.5.1

What's Changed

Contributors