TorchAO 0.1.0: First Release
Highlights
We’re excited to announce the release of TorchAO v0.1.0! TorchAO is a repository to host architecture optimization techniques such as quantization and sparsity and performance kernels on different backends such as CUDA and CPU. In this release, we added support for a few quantization techniques like int4 weight only GPTQ quantization, added nf4 dtype support for QLoRA and sparsity features like WandaSparsifier, we also added autotuner that can tune triton integer matrix multiplication kernels on cuda.
Note: TorchAO is currently in a pre-release state and under extensive development. The public APIs should not be considered stable. But we welcome you to try out our APIs and offerings and provide any feedback on your experience.
torchao 0.1.0 will be compatible with PyTorch 2.2.2 and 2.3.0, ExecuTorch 0.2.0 and TorchTune 0.1.0.
New Features
Quantization
- Added tensor subclass based quantization APIs:
change_linear_weights_to_int8_dqtensors,change_linear_weights_to_int8_woqtensorsandchange_linear_weights_to_int4_woqtensors(#1) - Added module based quantization APIs for int8 dynamic and weight only quantization
apply_weight_only_int8_quantandapply_dynamic_quant(#1) - Added module swap version of int4 weight only quantization
Int4WeightOnlyQuantizerandInt4WeightOnlyGPTQQuantizerused in TorchTune (#119, #116) - Added int8 dynamic activation and int4 weight quantization
Int8DynActInt4WeightQuantizerandInt8DynActInt4WeightGPTQQuantizer, used in ExecuTorch (#74) (available after torch 2.3.0 and later)
Sparsity
- Added
WandaSparsifierthat prunes both weights and activations (#22)
Kernels
- Added
autotunerfor int mm Triton kernels (#41)
dtypes
Improvements
- Setup github workflow for regression testing (#50)
- Setup github workflow for
torchao-nightlyrelease (#54)
Documentation
- Added tutorials for quantizing vision transformer model (#60)
- Added tutorials for how to add an op for
nf4tensor (#54)
Notes
- we are still debugging the accuracy problem for
Int8DynActInt4WeightGPTQQuantizer - Save and load does not work well for tensor subclass based APIs yet
- We will consolidate tensor subclass and module swap based quantization APIs later
uint4tensor subclass is going to be merged into pytorch core in the future- Quantization ops in
quant_primitives.pywill be deduplicated with similar quantize/dequantize ops in PyTorch later