Why nni quantize the output of layer? #4210

twmht · 2021-09-25T09:21:09Z

twmht
Sep 25, 2021

Hi,

For QAT they quantize only the input of convolution and fully connected layer.

From the thread (NVIDIA/TensorRT#993) they said TensorRT PTQ quantize the output of all layers.

From nni (https://github.com/microsoft/nni/blob/master/nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py#L213), you also quantize the layer's output. and i don't understand why you also quantize the input of layer. for example, i have conv1->relu1->conv2->relu2, quantize the output of relu1 is equal to quantize the input of conv2. and quantize the output of relu2 is not necessary.

what is the advantage for quantize layer's output?

it seems that QAT should be much faster than PTQ, but in practice it's not. Any idea?