Ryzen AI Quantization Tutorial
1.Introduction	2.ONNX Quantization Tutorial	3. Pytorch Quantization Tutorial	4.Tensorflow1.x quantization tutorial	5.Tensorflow2.x quantization tutorial

3. Pytorch Quantization Tutorial

Introduction

This tutorials takes Resnet50 pytorch model as an example and shows how to generate quantized onnx model with Ryzen AI quantizer. Then you can run it with onnxruntime on Ryzen AI PCs.

Setup

The quantizer is released in the Ryzen AI tool docker. The docker runs on a Linux host. Optionally you can run it on WSL2 of your Windows PCs. The quantizer can run on GPU which support ROCm or CUDA as well as X86 CPU. dGPU is recommended because it runs quantization much faster than CPU.

Make sure the Docker is installed on Linux host or WSL2. Please refer to official Docker documentation to install.

If you use X86 CPU, run

docker pull xilinx/vitis-ai-pytorch-cpu:latest

If you use ROCm GPU, run

docker pull xilinx/vitis-ai-pytorch-rocm:latest

If you use CUDA GPU, you need to build the docker

cd docker
./docker_build.sh -t gpu -f pytorch

and run

docker pull xilinx/vitis-ai-pytorch-gpu:latest

You can now start the Docker using the following command depending on your device:

./docker_run.sh xilinx/vitis-ai-pytorch-<cpu|rocm|gpu>:latest

Quick Start in Docker environment

You should be in the conda environment "vitis-ai-pytorch", in which vai_q_pytorch package is already installed. In this conda environment, python version is 3.7, pytorch version is 1.12 and torchvision version is 0.13.

Copy resnet50_quant.py to docker environment

Download pre-trained Resnet50 model

wget https://download.pytorch.org/models/resnet50-19c8e357.pth

Prepare Imagenet validation images. Pytorch example repo for reference.
Modify default data_dir and model_dir in resnet50_quant.py
Data Preparation (optional: for accuracy evaluation)

Download the iamagenet dataset
Organise the dataset directory as follows

imagenet/
├── train
│   ├── n01440764
│   ├── n01443537
│   ├── n01484850
│   ├── n01491361
│   ├── n01494475
│   ├── n01496331
│   ├── n01498041
│   ├── n01514668
|
└── val
    ├── n01440764
    ├── n01443537
    ├── n01484850
    ├── n01491361
    ├── n01494475
    ├── n01496331

[Optional] Evaluate float model

python resnet50_quant.py --quant_mode float

[Optional] Inspect float model

python resnet50_quant.py --quant_mode float --inspect --target AMD_AIE2_Nx4_Overlay

Quantize, using a subset (200 images) of validation data for calibration. Because we are in quantize calibration process, the displayed loss and accuracy are meaningless.
```
python resnet50_quant.py --quant_mode calib --subset_len 200
```

Evaluate quantized model

python resnet50_quant.py --quant_mode test

Export onnx model

python resnet50_quant.py --quant_mode test --subset_len 1 --batch_size 1 --deploy

vai_q_pytorch Tool Usage

vai_q_pytorch is designed to work as a Pytorch plugin. We provide simplest APIs to introduce our FPGA-friendly quantization feature. For a well-defined model, user only need to add 2-3 lines to get a quantize model object.

Model pre-requirements for quantizer

The model to be quantized should include forward method only. All other functions should be moved outside or move to a derived class. These functions usually work as pre-processing and post-processing. If they are not moved outside, our API will remove them in our quantized module, which will cause unexpected behavior when forwarding quantized module.
The float model should pass "jit trace test". First set the float module to evaluation status, then use “torch.jit.trace” function to test the float model. Make sure the float module can pass the trace test.For more details, please refer to example/jupyter_notebook/jit_trace_test/jit_trace_test.ipynb.
The most common operators in pytorch are supported in quantizer, please refer to support_op.md for details.

Inspect float model before quantization

Vai_q_pytorch provides a function called inspector to help users diagnose neural network (NN) models under different device architectures. The inspector can predict target device assignments based on hardware constraints.The generated inspection report can be used to guide users to modify or optimize the NN model, greatly reducing the difficulty and time of deployment. It is recommended to inspect float models before quantization.

Take resnet50_quant.py to demonstrate how to apply this feature.

Import vai_q_pytorch modules

 from pytorch_nndct.apis import Inspector

Create a inspector with target name or fingerprint.
```
 inspector = Inspector(target) 
```

Inspect float model.

 input = torch.randn([batch_size, 3, 224, 224])
 inspector.inspect(model, input)

Run command with "--quant_mode float --inspect --target {target_name}" to inspect model.

    python resnet50_quant.py --quant_mode float --inspect --target AMD_AIE2_Nx4_Overlay

Inspector will display some special messages on screen with special color and special keyword prefix "VAIQ_*" according to the verbose_level setting.Note the messages displayed between "[VAIQ_NOTE]: =>Start to inspect model..." and "[VAIQ_NOTE]: =>Finish inspecting."

If the inspector runs successfully, three important files are usually generated under the output directory "./quantize_result".

    inspect_{target}.txt: Target information and all the details of operations in float model.
    inspect_{target}.svg: If image_format is not None. A visualization of inspection result is generated.
    inspect_{target}.gv: If image_format is not None. Dot source code of inspetion result is generated.

Note:

The inspector relies on 'xcompiler' package. In conda env vitis-ai-pytorch in Vitis-AI docker, xcompiler is ready. But if vai_q_pytorch is installed by source code, it needs to install xcompiler in advance.
Visualization of inspection results relies on the dot engine.If you don't install dot successfully, set 'image_format = None' when inspecting.
If you need more detailed guidance, you can refer to ./example/jupyter_notebook/inspector/inspector_tutorial.ipynb. Please install jupyter notebook in advance. Run the following command:

jupyter notebook example/jupyter_notebook/inspector/inspector_tutorial.ipynb

Add vai_q_pytorch APIs to float scripts

Before quantization, suppose there is a trained float model and some python scripts to evaluate model's accuracy/mAP. Quantizer API will replace float module with quantized module and normal evaluate function will encourage quantized module forwarding. Quantize calibration determines "quantize" op parameters in evaluation process if we set flag quant_mode to "calib". After calibration, we can evaluate quantized model by setting quant_mode to "test".

Take resnet50_quant.py to demonstrate how to add vai_q_pytorch APIs in float code. Xilinx Pytorch Modelzoo includes float model and quantized model. It is a good idea to check the difference between float and quantized script, like "code/test.py" and "quantize/quant.py" in ENet.

Import vai_q_pytorch modules

 from pytorch_nndct.apis import torch_quantizer, dump_xmodel

Generate a quantizer with quantization needed input and get converted model.

 input = torch.randn([batch_size, 3, 224, 224])
 quantizer = torch_quantizer(quant_mode, model, (input))
 quant_model = quantizer.quant_model

Forwarding with converted model.

 acc1_gen, acc5_gen, loss_gen = evaluate(quant_model, val_loader, loss_fn)

Output quantization result and deploy model.

if quant_mode == 'calib':
  quantizer.export_quant_config()
if deploy:
  quantizer.export_torch_script()
  quantizer.export_onnx_model()
  quantizer.export_xmodel()

Run and output results

Before running commands, let's introduce the log message in vai_q_pytorch. vai_q_pytorch log messages have special color and special keyword prefix "VAIQ_*". vai_q_pytorch log message types include "error", "warning" and "note". Pay attention to vai_q_pytorch log messages to check the flow status.

Run command with "--quant_mode calib" to quantize model.

    python resnet50_quant.py --quant_mode calib --subset_len 200

When doing calibration forward, we borrow float evaluation flow to minimize code change from float script. So there are loss and accuracy displayed in the end. They are meaningless, just skip them. Pay more attention to the colorful log messages with special keywords "VAIQ_*".

Another important thing is to control iteration numbers during quantization and evaluation. Generally, 100-1000 images are enough for quantization and the whole validation set are required for evaluation. The iteration numbers can be controlled in the data loading part. In this case, argument "subset_len" controls how many images used for network forwarding. But if the float evaluation script doesn't have an argument with similar role, it is better to add one, otherwise it should be changed manually.

If this quantization command runs successfully, two important files will be generated under output directory “./quantize_result”.

    ResNet.py: converted vai_q_pytorch format model, 
    Quant_info.json: quantization steps of tensors got. (Keep it for evaluation of quantized model)

To evaluate quantized model, run the following command:

    python resnet50_quant.py --quant_mode test

When this command finishes, the displayed accuracy is the right accuracy for quantized model.

To export xmodel, batch size 1 is must for compilation, and subset_len=1 is to avoid redundant iteration. Run the following command:

    python resnet50_quant.py --quant_mode test --subset_len 1 --batch_size 1 --deploy

Skip loss and accuracy displayed in log in this run. Xmodel file for Vitis AI compiler will be generated under output directory “./quantize_result”. It will be further used to deploy this model to FPGA.

    ResNet_int.xmodel: deployed XIR format model
    ResNet_int.onnx:   deployed onnx format model
    ResNet_int.pt:     deployed torch script format model

In conda env vitis-ai-pytorch in Vitis-AI docker, XIR is ready. But if vai_q_pytorch is installed by source code, it needs to install XIR in advance.
If XIR is not installed, xmodel file can't be generated, this command will raise error in the end.

Module partial quantization

Sometimes not all the sub-modules in a module will be quantized. Besides call general vai_q_pytorch APIS, QuantStub/DeQuantStub OP pair can be used to realize it.
Example code for quantizing subm0 and subm2, but not to quantize subm1:

from pytorch_nndct.nn import QuantStub, DeQuantStub

class WholeModule(torch.nn.module):
    def __init__(self,...):
        self.subm0 = ...
        self.subm1 = ...
        self.subm2 = ...

        # define QuantStub/DeQuantStub submodules
        self.quant = QuantStub()
        self.dequant = DeQuantStub()

    def forward(self, input):
        input = self.quant(input) # begin of part to be quantized
        output0 = self.subm0(input)
        output0 = self.dequant(output0) # end of part to be quantized

        output1 = self.subm1(output0)

        output1 = self.quant(output1) # begin of part to be quantized
        output2 = self.subm2(output1)
        output2 = self.dequant(output2) # end of part to be quantized

Register Custom Operation

In the XIR Op library, there's a well-defined set of operators to cover the wildly used deep learning frameworks, e.g. TensorFlow, Pytorch and Caffe, and all of the build-in operators for DPU. This enhanced the expression ability and achieved one of the core goals, which is eliminating the difference between these frameworks and providing a unified representation for users and developers. However, this Op library can’t cover all of Ops in the upstream frameworks. So, the XIR provide a new op definition to express all other unknown Ops for Op library. In order to convert a quantized model to xmodel，vai_q_pytorch provide a decorator to register a operation or a group of operations as a custom operation which is unknown for XIR.

# Decorator API
def register_custom_op(op_type: str, attrs_list: Optional[List[str]] = None):
  """The decorator is used to register the function as a custom operation.
  Args:
  op_type(str) - the operator type registered into quantizer. 
                 The type should not conflict with pytorch_nndct
                
  attrs_list(Optional[List[str]], optional) - 
  the name list of attributes that define operation flavor. 
  For example, Convolution operation has such attributes as padding, dilation, stride and groups. 
  The order of name in attrs_list should be consistent with that of the arguments list. 
  Default: None
  
  """

How to use:

Aggregate some operations as a function. The first argument name of this function should be ctx. The meaning of ctx is the same as that in torch.autograd.Function.
Decorate this function with the decorator described above.
Example model code is in example/resnet50_quant_custom_op.py. On how to run it, please refer to example/resnet50_quant.py.

from pytorch_nndct.utils import register_custom_op

@register_custom_op(op_type="MyOp", attrs_list=["scale_1", "scale_2"])
def custom_op(ctx, x: torch.Tensor, y:torch.Tensor, scale_1:float, scale_2:float) -> torch.Tensor:
	return scale_1 * x + scale_2 * y  

class MyModule(torch.nn.Module):
def __init__(self):
   ...

def forward(self, x, y):
   return custom_op(x, y, 0.5, 0.5)

Limitations:

Loop operation is not allowed in a custom operation
The number of return values for a custom operation can only be one.

Fast finetune model

Sometimes direct quantization accuracy is not high enough, then it needs finetune model parameters.

The fast finetuning is not real training of the model, and only needs limited number of iterations. For classification models on Imagenet dataset, 5120 images are enough in general.
It only needs do some modification based on evaluation model script and does not need setup optimizer for training.
A function for model forwarding iteration is needed and will be called among fast finetuning.
Re-calibration with original inference code is highly recommended.
Example code in example/resnet50_quant.py:

# fast finetune model or load finetuned parameter before test 
  if finetune == True:
      ft_loader, _ = load_data(
          subset_len=5120,
          train=False,
          batch_size=batch_size,
          sample_method='random',
          data_dir=args.data_dir,
          model_name=model_name)
      if quant_mode == 'calib':
        quantizer.fast_finetune(evaluate, (quant_model, ft_loader, loss_fn))
      elif quant_mode == 'test':
        quantizer.load_ft_param()

For example/resnet50_quant.py, command line to do parameter fast finetuning and re-calibration,
```
python resnet50_quant.py --quant_mode calib --fast_finetune
```

command line to test fast finetuned quantized model accuracy,

python resnet50_quant.py --quant_mode test --fast_finetune

command line to deploy fast finetuned quantized model,

python resnet50_quant.py --quant_mode test --fast_finetune --subset_len 1 --batch_size 1 --deploy

Finetune quantized model

This mode can be used to finetune a quantized model (loading float model parameters), also can be used to do quantization-aware-training (QAT) from scratch.
It needs to add some vai_q_pytorch interface functions based on the float model training script.
The mode requests the trained model cannot use +/- operator in model forwarding code. It needs to replace them with torch.add/torch.sub module.
Example code in example/resnet50_qat.py:

  # create quantizer can do QAT
  input = torch.randn([batch_size, 3, 224, 224], dtype=torch.float32)
  from pytorch_nndct import QatProcessor
  qat_processor = QatProcessor(model, inputs, bitwidth=8)
  quantized_model = qat_processor.trainable_model()

  # get the deployable model and test it
  output_dir = 'qat_result'
  deployable_model = qat_processor.to_deployable(quantized_model, output_dir)
  validate(val_loader, deployable_model, criterion, gpu)

  # export xmodel from deployable model
  # need at least 1 iteration of inference with batch_size=1 
  # Use cpu mode to export xmodel.
  deployable_model.cpu()
  val_subset = torch.utils.data.Subset(val_dataset, list(range(1)))
  subset_loader = torch.utils.data.DataLoader(
    val_subset,
    batch_size=1,
    shuffle=False,
    num_workers=8,
    pin_memory=True)
  # Must forward deployable model at least 1 iteration with batch_size=1
  for images, _ in subset_loader:
    deployable_model(images)
  qat_processor.export_xmodel(output_dir)

Configuration of Quantization Strategy

For multiple quantization strategy configurations, vai_q_pytorch supports quantization configuration file in JSON format. And we only need to pass the configuration file to torch_quantizer API. Example code in resnet50_quant.py:

config_file = "./pytorch_quantize_config.json"
quantizer = torch_quantizer(quant_mode=quant_mode, 
                            module=model, 
                            input_args=(input), 
                            device=device, 
                            quant_config_file=config_file)

For detailed information of the json file contents, please refer to Quant_Config.md

Hardware-Aware Quantization Strategy

The Inspector provides device assignments to operators in the neural network based on the target device. vai_q_pytorch can use the power of inspector to perform hardware-aware quantization. Example code in resnet50_quant.py:

quantizer = torch_quantizer(quant_mode=quant_mode, 
                            module=model, 
                            input_args=(input), 
                            device=device, 
                            quant_config_file=config_file, target=target)

For example/resnet50_quant.py, command line to do hardware-aware calibration,
```
python resnet50_quant.py --quant_mode calib --target DPUCAHX8L_ISA0_SP
```

command line to test hardware-aware quantized model accuracy,

python resnet50_quant.py --quant_mode test --target DPUCAHX8L_ISA0_SP

command line to deploy quantized model,

python resnet50_quant.py --quant_mode test --target DPUCAHX8L_ISA0_SP --subset_len 1 --batch_size 1 --deploy

vai_q_pytorch main APIs

The APIs for CNN are in module pytorch_binding/pytorch_nndct/apis.py:

Function torch_quantizer will create a quantizer.

class torch_quantizer(): 
  def __init__(self,
               quant_mode: str, # ['calib', 'test']
               module: torch.nn.Module,
               input_args: Union[torch.Tensor, Sequence[Any]] = None,
               state_dict_file: Optional[str] = None,
               output_dir: str = "quantize_result",
               bitwidth: int = 8,
               device: torch.device = torch.device("cuda"),
               quant_config_file: Optional[str] = None,
               target: Optional[str]=None):

quant_mode: An integer that indicates which quantization mode the process is using. "calib" for calibration of quantization. "test" for evaluation of quantized model.
Module: Float module to be quantized.
Input_args: Input tensor with the same shape as real input of float module to be quantized, but the values can be random number.
State_dict_file: Float module pretrained parameters file. If float module has read parameters in, the parameter is not needed to be set.
Output_dir: Directory for quantization result and intermediate files. Default is “quantize_result”.
Bitwidth: Global quantization bit width. Default is 8.
Device: Run model on GPU or CPU.
quant_config_file: Json file path for quantization strategy configuration
target: If target device is specified, the hardware-aware quantization is on. Default is None.

Export quantization steps information

  def export_quant_config(self):

Export xmodel and dump OPs output data for detailed data comparison

  def export_xmodel(self, output_dir, deploy_check):

Output_dir: Directory for quantization result and intermediate files. Default is “quantize_result”
Depoly_check: Flags to control dump of data for detailed data comparison. Default is False. If it is set to True, binary format data will be dumped to output_dir/deploy_check_data_int/.

Export onnx format quantized model

  def export_onnx_model(self, output_dir, verbose):

Output_dir: Directory for quantization result and intermediate files. Default is “quantize_result”
Verbose: Flag to control showing verbose log or no

Export torchscript format quantized model

  def export_torch_script(self, output_dir="quantize_result", verbose=False):

Output_dir: Directory for quantization result and intermediate files. Default is “quantize_result”
Verbose: Flag to control showing verbose log or not

Create a inspector

class Inspector():
  def __init__(self, name_or_fingerprint: str):

name_or_fingerprint: Specify the hardware target name or fingerprint

Inspect float model

  def inspect(self, 
              module: torch.nn.Module, 
              input_args: Union[torch.Tensor, Tuple[Any]], 
              device: torch.device = torch.device("cuda"),
              output_dir: str = "quantize_result",
              verbose_level: int = 1,
              image_format: Optional[str] = None):

module: Float module to be depolyed
input_args: Input tensor with the same shape as real input of float module, but the values can be random number.    
device: Trace model on GPU or CPU.
output_dir: Directory for inspection results
verbose_level: Control the level of detail of the inspection results displayed on the screen. Defaut:1
    0: turn off printing inspection results.
    1: print summary report of operations assigned to CPU.
    2: print summary report of device allocation of all operations.
image_format: Export visualized inspection result. Support 'svg' / 'png' image format.

Quantization Aware Training

This demonstrate code supplies the quantization for ResNet-18 and MobileNet V2. This demonstrate supplies several configurations under different quantization schemas. Including methods like TQT, LSQ, and statistic-based float/poweroftwo scale quantization.

Train a quantized model

cd <path-to-RyzenSW/>
python main.py \
      --model_name=[model] \
      --pretrained=[model_weight_path]
      --mode=train \
      --config_file=[configs_file_path]

Deploy a quantized model (Not all types quantized model support export deploy model, under developing )

$ cd <Path-to-RyzenAI_quant_tutorial>/pytorch_example/torch_qat
$ python main.py \
    --model_name=[model] \
    --qat_ckpt=[model_weight_path]
    --mode=deploy \
    --config_file=[configs_file_path]

model_name: MobileNetV2, resnet18
pretrained: pre-trained fp32 model file path
qat_ckpt: quantized model weight saved after - training
config_file: e.g: ./config_files/int_pof2_tqt.json (Different quantization method config file can be seen in fold configs )

Auto module

In QAT, tensors are quantized by manipulation of torch.nn.Module objects. For parameter quantization, quantizer replaces modules with parameters with quantized version of the modules(for instance, replacing nn.Conv2d with QuantizedConv2d and replacing nn.Linear with QuantizedLinear), in which quantizers for parameters are inserted. For input and output quantization, the quantizer adds quantization step in module's forward_hook and forward_pre_hook. But for non-module operations, the quantizer cannot directly modify their behavior. For example, if we have a model:

import torch
from torch import nn

class MyModel(nn.Module):
  def __init__(self):
    super().__init__()
  
  def forward(self, x, y, z):
    tmp = x + y
    ret = tmp * z
    return ret

How do we quantize intputs and outputs of operator "+" and "*" in this model? One way is to replace the native operators with modules mannually like below.

import torch
from torch import nn

class Add(nn.Module):
  def __init__(self):
    super().__init__()

  def forward(self, x, y):
    return x + y

class Mul(nn.Module):
  def __init__(self):
    super().__init__()

  def forward(self, x, y):
    return x * y

class MyModel(nn.Module):
  def __init__(self):
    super().__init__()
    self.add = Add()
    self.Mul = Mul()

  def forward(self, x, y, z):
    tmp = self.add(x, y)
    ret = self.mul(tmp, z)
    return ret

After the replacement, QAT can insert quantizers for tensor x, y, tmp and z via forward_hook and forward_pre_hook.

QAT now provides a tool called auto_module, which could be used to replace the non-module operators with modules automatically.

Usage

Users can use auto_module by a simple function call. Here's an example.

import torch
from torch import nn
from pytorch_nndct.quantization.auto_module import wrap

class MyModel(nn.Module):
  def __init__(self):
    super().__init__()
  
  def forward(self, x, y, z):
    tmp = x + y
    ret = tmp * z
    return ret

model = MyModel()
x = torch.rand((2, 2))
y = torch.rand((2, 2))
z = torch.rand((2, 2))
wrapped_model = wrap(model, x, y, z)

wrapped_model is the model with all native operators("+" and "*" in this example) replaced with modules, which can be processed by QatProcessor like normal models.

The signature of function wrap is as follows.

def wrap(model: nn.Module, *args, **kwargs) -> nn.Module
  """
  Args:
    model (nn.Module): The model to be processed by auto_module
    args, kwargs: The inputs of model. There is model inference in this function. 
      `args` and `kwargs` will be fed to model directly during model inference.
  """

▶️Next Topic: 3. Tensorflow1.x Quantization Tutorial

◀️Previous Topic: 2. ONNX Quantization Tutorial

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PT_README.md

PT_README.md

Ryzen AI Quantization Tutorial

3. Pytorch Quantization Tutorial

Introduction

Setup

Quick Start in Docker environment

vai_q_pytorch Tool Usage

Model pre-requirements for quantizer

Inspect float model before quantization

Add vai_q_pytorch APIs to float scripts

Run and output results

Module partial quantization

Register Custom Operation

Fast finetune model

Finetune quantized model

Configuration of Quantization Strategy

Hardware-Aware Quantization Strategy

vai_q_pytorch main APIs

Function torch_quantizer will create a quantizer.

Export quantization steps information

Export xmodel and dump OPs output data for detailed data comparison

Export onnx format quantized model

Export torchscript format quantized model

Create a inspector

Inspect float model

Quantization Aware Training

Auto module

Files

PT_README.md

Latest commit

History

PT_README.md

File metadata and controls

Ryzen AI Quantization Tutorial

3. Pytorch Quantization Tutorial

Introduction

Setup

Quick Start in Docker environment

vai_q_pytorch Tool Usage

Model pre-requirements for quantizer

Inspect float model before quantization

Add vai_q_pytorch APIs to float scripts

Run and output results

Module partial quantization

Register Custom Operation

Fast finetune model

Finetune quantized model

Configuration of Quantization Strategy

Hardware-Aware Quantization Strategy

vai_q_pytorch main APIs

Function torch_quantizer will create a quantizer.

Export quantization steps information

Export xmodel and dump OPs output data for detailed data comparison

Export onnx format quantized model

Export torchscript format quantized model

Create a inspector

Inspect float model

Quantization Aware Training

Auto module