You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
Recently I found out that TensorRT's QAT (https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization) is slower than PTQ and fp16.
For QAT they quantize only the input of convolution and fully connected layer.
From the thread (NVIDIA/TensorRT#993) they said TensorRT PTQ quantize the output of all layers.
From nni (https://github.com/microsoft/nni/blob/master/nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py#L213), you also quantize the layer's output. and i don't understand why you also quantize the input of layer. for example, i have
conv1->relu1->conv2->relu2
, quantize the output of relu1 is equal to quantize the input of conv2. and quantize the output of relu2 is not necessary.what is the advantage for quantize layer's output?
it seems that QAT should be much faster than PTQ, but in practice it's not. Any idea?
Beta Was this translation helpful? Give feedback.
All reactions