This repository contains the code to jointly perform mixed-precision quantization and pruning, with a differentiable algorithm. Check out our paper "Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks" (arxiv) for more details about the algorithm and the implementation.
For this project the cost models of MPIC and the NE16 DNN accelerator have been used and are located in the hardware_models
folder.
We report below the results on the CIFAR-10 benchmark, when employ the hardware-agnostic size regularizer. We compare with various state-of-the-art approaches, and with the sequential application of a pruning algorithm (PIT) and a channel-wise mixed-precision quantization technique (denoted as "MixPrec" in the plot).
More details and experiments on different benchmarks can be found in our paper.
We have evaluated our approach on CIFAR-10 with the Mixed Precision Inference Core (MPIC) and Neural Engine 16 (NE16) accelerator hardware cost models. We then evaluated the obtained architecture on both hardware, to assess the importance of a well-tailored cost models during training to obtain good architectures.
We refer to our paper for more details on the cost models and on the conducted experiments.
We have also considered the ImageNet dataset to assess the behavior of the algorithm for large models. We adopted the same training protocol and quantization schemes used in the other experiments of our manuscript (note that the results could be improved by exploring more advanced quantization algorithms and training hyperparameters, which are fully orthogonal to our optimization method).
Our proposed algorithm was able to obtain a Pareto front of architectures in the accuracy vs. number of inference cycles space, surpassing the fixed-precision baselines, especially in the low cycles regime. These results confirmed that our method can still work for larger-scale datasets and models.
Morevoer, it is possible to see how, as expected, well-tailored hardware cost models have a stronger impact when the optimization is applied to tiny neural networks. This happens because the relative impact of a non-ideal precision assignment is lower when the layer's size increases.