This article covers using TensorRT for deployment of PyTorch models.
NVIDIA TensorRT is an SDK for high-performance deep learning inference on NVIDIA GPU devices. It includes the inference engine and parsers for handling various input network specification formats. TensorRT provides application programmer interfaces (API) for C++ and Python. This article will present example programs using both these languages.
To deploy PyTorch models using TensorRT, we will export them in ONNX format. ONNX stands for Open Neural Network Exchange and is an open format built to represent deep learning models in a framework-agnostic way. TensorRT provides a specialized parser for importing ONNX models.
We assume that you will continue using the Genesis Cloud GPU-enabled instance that you created and configured while studying the Article 1.
In particular, the following software must be installed and configured as described in that article:
- CUDA 11.3.1
- cuDNN 8.2.1
- Python 3.x interpreter and package installer
pip
- PyTorch 1.10.1 with torchvision 0.11.2
Various assets (source code, shell scripts, and data files) used in this article can be found in the supporting GitHub repository.
To run examples described in this article we recommend cloning the entire
repository on your Genesis Cloud instance.
The subdirectory art03
must be made your current directory.
The version of TensorRT must be compatible with the chosen versions of CUDA and cuDNN. For our choice of CUDA 11.3.1 and cuDNN 8.2.1 we will need TensorRT 8.0.3. (The actual support matrix for TensorRT 8.x is available here.)
To access TensorRT, you should register as a member of the NVIDIA Developer Program.
To download the TensorRT distribution, visit the official download site. Choose "TensorRT 8", then agree to the "NVIDIA TensorRT License Agreement" and choose "TensorRT 8.0 GA Update 1" ("GA" stands for "General Availability"). Select and download "TensorRT 8.0.3 GA for Ubuntu 20.04 and CUDA 11.3 DEB local repo package". You will get a DEB repo file; at the time of writing this article its name was:
nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.3.4-ga-20210831_1-1_amd64.deb
Place it in a scratch directory on you instance (we use ~/transit
in this series of articles),
then proceed with installation by entering these commands:
sudo dpkg -i nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.3.4-ga-20210831_1-1_amd64.deb
sudo apt-key add /var/nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.3.4-ga-20210831/7fa2af80.pub
sudo apt-get update
sudo apt-get install tensorrt
Then install Python bindings for TensorRT API:
python3 -m pip install numpy
sudo apt-get install python3-libnvinfer-dev
Verify the installation using the command:
dpkg -l | grep TensorRT
Detailed installation instructions can be found on the official "Installing TensorRT" page.
PyCUDA is a Python package implementing access to the CUDA API from Python. The Python programs described in this article require PyCUDA for accessing the basic CUDA functionality like managing CUDA device memory buffers.
Before starting the PyCUDA installation make sure that the NVIDIA CUDA compiler
driver nvcc
is accessible by entering the command:
nvcc --version
If this command fails, update the PATH
environment variable:
export PATH=/usr/local/cuda/bin:$PATH
To install PyCUDA, enter the command:
python3 -m pip install pycuda
We will continue using the torchvision image classification models for our examples. As the first step, we will demonstrate conversion of the already familiar ResNet50 model to ONNX format.
The Python program generate_onnx_resnet50.py
serves this purpose.
import torch
import torchvision.models as models
input = torch.rand(1, 3, 224, 224)
model = models.resnet50(pretrained=True)
model.eval()
output = model(input)
torch.onnx.export(model, input, "./onnx/resnet50.onnx", export_params=True)
This program:
- creates a dummy input tensor
- creates a pretrained ResNet50 model
- sets the model in evaluation (inference) mode
- runs dummy inference for the model
- exports model to ONNX format and saves result in a file
We store generated ONNX files in the subdirectory onnx
which
must be created before running the program:
mkdir -p onnx
To run this program, use the command:
python3 generate_onnx_resnet50.py
The program will produce a file resnet50.onnx
containing the ONNX model representation.
The Python program generate_onnx_all.py
can be used to produce ONNX descriptions
for all considered torchvision image classification models.
import torch
import torchvision.models as models
MODELS = [
('alexnet', models.alexnet),
('densenet121', models.densenet121),
('densenet161', models.densenet161),
('densenet169', models.densenet169),
('densenet201', models.densenet201),
('mnasnet0_5', models.mnasnet0_5),
('mnasnet1_0', models.mnasnet1_0),
('mobilenet_v2', models.mobilenet_v2),
('mobilenet_v3_large', models.mobilenet_v3_large),
('mobilenet_v3_small', models.mobilenet_v3_small),
('resnet18', models.resnet18),
('resnet34', models.resnet34),
('resnet50', models.resnet50),
('resnet101', models.resnet101),
('resnet152', models.resnet152),
('resnext50_32x4d', models.resnext50_32x4d),
('resnext101_32x8d', models.resnext101_32x8d),
('shufflenet_v2_x0_5', models.shufflenet_v2_x0_5),
('shufflenet_v2_x1_0', models.shufflenet_v2_x1_0),
('squeezenet1_0', models.squeezenet1_0),
('squeezenet1_1', models.squeezenet1_1),
('vgg11', models.vgg11),
('vgg11_bn', models.vgg11_bn),
('vgg13', models.vgg13),
('vgg13_bn', models.vgg13_bn),
('vgg16', models.vgg16),
('vgg16_bn', models.vgg16_bn),
('vgg19', models.vgg19),
('vgg19_bn', models.vgg19_bn),
('wide_resnet50_2', models.wide_resnet50_2),
('wide_resnet101_2', models.wide_resnet101_2),
]
def generate_model(name, builder):
print('Generate', name)
input = torch.rand(1, 3, 224, 224)
model = builder(pretrained=True)
model.eval()
output = model(input)
onnx_path = './onnx/' + name + '.onnx'
torch.onnx.export(model, input, onnx_path, export_params=True)
for name, model in MODELS:
generate_model(name, model)
To run this program, enter the following commands:
mkdir -p onnx
python3 generate_onnx_all.py
To perform inference of the ONNX model using TensorRT, it must be pre-processed using the TensorRT ONNX parser. We will start with conversion of the ONNX representation to the TensorRT plan. The TensorRT plan is a serialized form of a TensorRT engine. The TensorRT engine represents the model optimized for execution on a chosen CUDA device.
The Python program trt_onnx_parser.py
serves this purpose.
import sys
import tensorrt as trt
def main():
if len(sys.argv) != 3:
sys.exit("Usage: python3 trt_onnx_parser.py <input_onnx_path> <output_plan_path>")
onnx_path = sys.argv[1]
plan_path = sys.argv[2]
logger = trt.Logger()
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
config.max_workspace_size = 256 * 1024 * 1024
config.set_flag(trt.BuilderFlag.DISABLE_TIMING_CACHE)
parser = trt.OnnxParser(network, logger)
ok = parser.parse_from_file(onnx_path)
if not ok:
sys.exit("ONNX parse error")
plan = builder.build_serialized_network(network, config)
with open(plan_path, "wb") as fp:
fp.write(plan)
print("DONE")
main()
The Python package tensorrt
implements TensorRT Python API and
provides a collection of Python object classes used to handle
various aspects of TensorRT inference and model parsing.
This program uses the following TensorRT API object classes:
Logger
- logger used by several other object classesBuilder
- a factory used to create several other classesINetworkDefinition
- representation of TensorRT networks (models)IBuilderConfig
- a class used to hold configuration parameters forBuilder
OnnxParser
- a class used for parsing ONNX models into TensorRT network definitionsIHostMemory
- representation of buffers in a host memory
The program performs the following steps:
- creates
logger: Logger
representing a logger instance - creates
builder: Builder
representing a builder instance - uses
builder
to createnetwork: INetworkDefinition
representing an empty network instance - uses
builder
to createconfig: IBuilderConfig
representing a builder configuration instance - sets the
max_workspace_size
configuration parameter representing the maximum workspace size that can be used by inference algorithms - disables timing cache
- creates
parser: OnnxParser
representing an ONNX parser instance; reference to the previously created empty network definition is attached to the parser - uses
parser
to parse the input ONNX file and convert it to the TensorRT network definition; assigns the parsing result the attached network definition object - uses
builder
to createplan: IHostMemory
representing a serialized network (plan) stored in a host memory buffer - saves the plan in the output file
The program has two command line arguments: a path to the input ONNX file and a path to the output TensorRT plan file.
We store generated plan files in the subdirectory plan
which
must be created before running the program:
mkdir -p plan
To run this program for conversion of ResNet50 ONNX representation, use the command:
python3 trt_onnx_parser.py ./onnx/resnet50.onnx ./plan/resnet50.plan
The Python program trt_onnx_parser_all.py
can be used to produce ONNX representations
for all considered torchvision image classification models.
import sys
import tensorrt as trt
MODELS = [
'alexnet',
'densenet121',
'densenet161',
'densenet169',
'densenet201',
'mnasnet0_5',
'mnasnet1_0',
'mobilenet_v2',
'mobilenet_v3_large',
'mobilenet_v3_small',
'resnet18',
'resnet34',
'resnet50',
'resnet101',
'resnet152',
'resnext50_32x4d',
'resnext101_32x8d',
'shufflenet_v2_x0_5',
'shufflenet_v2_x1_0',
'squeezenet1_0',
'squeezenet1_1',
'vgg11',
'vgg11_bn',
'vgg13',
'vgg13_bn',
'vgg16',
'vgg16_bn',
'vgg19',
'vgg19_bn',
'wide_resnet50_2',
'wide_resnet101_2',
]
def setup_builder():
logger = trt.Logger()
builder = trt.Builder(logger)
return (logger, builder)
def generate_plan(logger, builder, name):
print('Generate TensorRT plan for ' + name)
onnx_path = './onnx/' + name + '.onnx'
plan_path = './plan/' + name + '.plan'
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
config.max_workspace_size = 256 * 1024 * 1024
config.set_flag(trt.BuilderFlag.DISABLE_TIMING_CACHE)
parser = trt.OnnxParser(network, logger)
ok = parser.parse_from_file(onnx_path)
if not ok:
sys.exit('ONNX parse error')
plan = builder.build_serialized_network(network, config)
with open(plan_path, "wb") as fp:
fp.write(plan)
def main():
logger, builder = setup_builder()
for name in MODELS:
generate_plan(logger, builder, name)
print('DONE')
main()
To run this program, enter the following commands:
mkdir -p plan
python3 trt_onnx_parser_all.py
Conversion of the ONNX representation to TensorRT plan can be also implemented using the TensorRT C++ API.
The C++ program trt_onnx_parser.cpp
serves this purpose.
#include <cstdio>
#include <cstdlib>
#include <cassert>
#include <NvInfer.h>
#include <NvOnnxParser.h>
#include "common.h"
// wrapper class for ONNX parser
class OnnxParser {
public:
OnnxParser();
~OnnxParser();
public:
void Init();
void Parse(const char *onnxPath, const char *planPath);
private:
bool m_active;
Logger m_logger;
UniquePtr<nvinfer1::IBuilder> m_builder;
UniquePtr<nvinfer1::INetworkDefinition> m_network;
UniquePtr<nvinfer1::IBuilderConfig> m_config;
UniquePtr<nvonnxparser::IParser> m_parser;
};
OnnxParser::OnnxParser(): m_active(false) { }
OnnxParser::~OnnxParser() { }
void OnnxParser::Init() {
assert(!m_active);
m_builder.reset(nvinfer1::createInferBuilder(m_logger));
if (m_builder == nullptr) {
Error("Error creating infer builder");
}
auto networkFlags = 1 << int(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
m_network.reset(m_builder->createNetworkV2(networkFlags));
if (m_network == nullptr) {
Error("Error creating network");
}
m_config.reset(m_builder->createBuilderConfig());
if (m_config == nullptr) {
Error("Error creating builder config");
}
m_config->setMaxWorkspaceSize(256 * 1024 * 1024);
m_config->setFlag(nvinfer1::BuilderFlag::kDISABLE_TIMING_CACHE);
m_parser.reset(nvonnxparser::createParser(*m_network, m_logger));
if (m_parser == nullptr) {
Error("Error creating ONNX parser");
}
}
void OnnxParser::Parse(const char *onnxPath, const char *planPath) {
bool ok = m_parser->parseFromFile(onnxPath, static_cast<int>(m_logger.SeverityLevel()));
if (!ok) {
Error("ONNX parse error");
}
UniquePtr<nvinfer1::IHostMemory> plan(m_builder->buildSerializedNetwork(*m_network, *m_config));
if (plan == nullptr) {
Error("Network serialization error");
}
const void *data = plan->data();
size_t size = plan->size();
FILE *fp = fopen(planPath, "wb");
if (fp == nullptr) {
Error("Failed to create file %s", planPath);
}
fwrite(data, 1, size, fp);
fclose(fp);
}
// main program
int main(int argc, char *argv[]) {
if (argc != 3) {
fprintf(stderr, "Usage: trt_onnx_parser <input_onnx_path> <output_plan_path>\n");
return 1;
}
const char *onnxPath = argv[1];
const char *planPath = argv[2];
printf("Generate TensorRT plan for %s\n", onnxPath);
OnnxParser parser;
parser.Init();
parser.Parse(onnxPath, planPath);
return 0;
}
The program is functionally similar to previously described Python program trt_onnx_parser.py
.
Plans generated using the Python and C++ program versions are interchangeable; each plan
can be used for the subsequent inference with Python and C++ programs described in
this article.
The program uses the TensorRT C++ API specified in two header files:
NvInfer.h
defines interface to the TensorRT inference engine encapsulated in thenvinfer1
namespaceNvOnnxParser.h
defines interface to the TensorRT ONNX parser encapsulated in thenvonnxparser
namespace
This program uses the following TensorRT API object classes:
nvinfer1::ILogger
- logger used by several other object classesnvinfer1::IBuilder
- a factory used to create several other classesnvinfer1::INetworkDefinition
- representation of TensorRT networks (models)nvinfer1::IBuilderConfig
- a class used to hold configuration parameters forIBuilder
nvonnxparser::IParser
- a class used for parsing ONNX models into TensorRT network definitionsnvinfer1::IHostMemory
- representation of buffers in a host memory
Class OnnxParser
holds smart pointers to instances of these objects.
It exposes two principal public methods: Init
and Parse
.
The Init
method performs the following steps:
- creates
m_builder
representing a builder instance - uses
m_builder
to createm_network
representing an empty network instance - uses
m_builder
to createm_config
representing a builder configuration instance - sets the
maxWorkspaceSize
configuration parameter representing the maximum workspace size that can be used by inference algorithms - disables timing cache
- creates
m_parser
representing an ONNX parser instance; reference to the previously created empty network definition is attached to the parser
The Parse
method performs the following steps:
- uses
m_parser
to parse the input ONNX file and convert it to the TensorRT network definition; assigns the parsing result the attached network definition object - uses
m_builder
to createplan
representing a serialized network (plan) stored in a host memory buffer - saves the plan in the output file
The shell script build_trt_onnx_parser.sh
must be used to compile and link this program:
#!/bin/bash
mkdir -p ./bin
g++ -o ./bin/trt_onnx_parser \
-I /usr/local/cuda/include \
trt_onnx_parser.cpp common.cpp \
-L /usr/local/cuda/lib64 -lnvonnxparser -lnvinfer -lcudart
Running this script is straightforward:
./build_trt_onnx_parser.sh
The program has two command line arguments: a path to the input ONNX file and a path to the output TensorRT plan file.
To run this program for conversion of ResNet50 ONNX representation, use the command:
./bin/trt_onnx_parser ./onnx/resnet50.onnx ./plan/resnet50.plan
The shell script trt_onnx_parser_all.sh
uses the C++ program to generate
ONNX representations for all considered torchvision image classification files:
#!/bin/bash
./bin/trt_onnx_parser ./onnx/alexnet.onnx ./plan/alexnet.plan
./bin/trt_onnx_parser ./onnx/densenet121.onnx ./plan/densenet121.plan
./bin/trt_onnx_parser ./onnx/densenet161.onnx ./plan/densenet161.plan
./bin/trt_onnx_parser ./onnx/densenet169.onnx ./plan/densenet169.plan
./bin/trt_onnx_parser ./onnx/densenet201.onnx ./plan/densenet201.plan
./bin/trt_onnx_parser ./onnx/mnasnet0_5.onnx ./plan/mnasnet0_5.plan
./bin/trt_onnx_parser ./onnx/mnasnet1_0.onnx ./plan/mnasnet1_0.plan
./bin/trt_onnx_parser ./onnx/mobilenet_v2.onnx ./plan/mobilenet_v2.plan
./bin/trt_onnx_parser ./onnx/mobilenet_v3_large.onnx ./plan/mobilenet_v3_large.plan
./bin/trt_onnx_parser ./onnx/mobilenet_v3_small.onnx ./plan/mobilenet_v3_small.plan
./bin/trt_onnx_parser ./onnx/resnet101.onnx ./plan/resnet101.plan
./bin/trt_onnx_parser ./onnx/resnet152.onnx ./plan/resnet152.plan
./bin/trt_onnx_parser ./onnx/resnet18.onnx ./plan/resnet18.plan
./bin/trt_onnx_parser ./onnx/resnet34.onnx ./plan/resnet34.plan
./bin/trt_onnx_parser ./onnx/resnet50.onnx ./plan/resnet50.plan
./bin/trt_onnx_parser ./onnx/resnext101_32x8d.onnx ./plan/resnext101_32x8d.plan
./bin/trt_onnx_parser ./onnx/resnext50_32x4d.onnx ./plan/resnext50_32x4d.plan
./bin/trt_onnx_parser ./onnx/shufflenet_v2_x0_5.onnx ./plan/shufflenet_v2_x0_5.plan
./bin/trt_onnx_parser ./onnx/shufflenet_v2_x1_0.onnx ./plan/shufflenet_v2_x1_0.plan
./bin/trt_onnx_parser ./onnx/squeezenet1_0.onnx ./plan/squeezenet1_0.plan
./bin/trt_onnx_parser ./onnx/squeezenet1_1.onnx ./plan/squeezenet1_1.plan
./bin/trt_onnx_parser ./onnx/vgg11.onnx ./plan/vgg11.plan
./bin/trt_onnx_parser ./onnx/vgg11_bn.onnx ./plan/vgg11_bn.plan
./bin/trt_onnx_parser ./onnx/vgg13.onnx ./plan/vgg13.plan
./bin/trt_onnx_parser ./onnx/vgg13_bn.onnx ./plan/vgg13_bn.plan
./bin/trt_onnx_parser ./onnx/vgg16.onnx ./plan/vgg16.plan
./bin/trt_onnx_parser ./onnx/vgg16_bn.onnx ./plan/vgg16_bn.plan
./bin/trt_onnx_parser ./onnx/vgg19.onnx ./plan/vgg19.plan
./bin/trt_onnx_parser ./onnx/vgg19_bn.onnx ./plan/vgg19_bn.plan
./bin/trt_onnx_parser ./onnx/wide_resnet101_2.onnx ./plan/wide_resnet101_2.plan
./bin/trt_onnx_parser ./onnx/wide_resnet50_2.onnx ./plan/wide_resnet50_2.plan
Running this script is straightforward:
mkdir -p ./plan
./trt_onnx_parser_all.sh
The inference programs in Python and C++ described in the rest of this article reuse several files introduced in Articles 1 and 2. These include:
imagenet_classes.txt
- class descriptions for ImageNet labels (Article 1)./data/husky01.dat
- pre-processed input tensor for the husky image (Article 2)
See the respective articles for details on obtaining these files.
The Python program trt_infer_plan.py
implements TensorRT inference using
the previously generated TensorRT plan and a pre-processed input image.
import sys
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
def softmax(x):
y = np.exp(x)
sum = np.sum(y)
y /= sum
return y
def topk(x, k):
idx = np.argsort(x)
idx = idx[::-1][:k]
return (idx, x[idx])
def main():
if len(sys.argv) != 3:
sys.exit("Usage: python3 trt_infer_plan.py <plan_path> <input_path>")
plan_path = sys.argv[1]
input_path = sys.argv[2]
print("Start " + plan_path)
# read the plan
with open(plan_path, "rb") as fp:
plan = fp.read()
# read the pre-processed image
input = np.fromfile(input_path, np.float32)
# read the categories
with open("imagenet_classes.txt", "r") as f:
categories = [s.strip() for s in f.readlines()]
# initialize the TensorRT objects
logger = trt.Logger()
runtime = trt.Runtime(logger)
engine = runtime.deserialize_cuda_engine(plan)
context = engine.create_execution_context()
# create device buffers and TensorRT bindings
output = np.zeros((1000), dtype=np.float32)
d_input = cuda.mem_alloc(input.nbytes)
d_output = cuda.mem_alloc(output.nbytes)
bindings = [int(d_input), int(d_output)]
# copy input to device, run inference, copy output to host
cuda.memcpy_htod(d_input, input)
context.execute_v2(bindings=bindings)
cuda.memcpy_dtoh(output, d_output)
# apply softmax and get Top-5 results
output = softmax(output)
top5p, top5v = topk(output, 5)
# print results
print("Top-5 results")
for ind, val in zip(top5p, top5v):
print(" {0} {1:.2f}%".format(categories[ind], val * 100))
main()
This program uses the following TensorRT API object classes:
Logger
- logger used by several other object classesRuntime
- used to deserialize TensorRT plans to TensorRT CUDA enginesICudaEngine
- engine for executing inference on built networksIExecutionContext
- context for executing inference using CUDA engine
The program performs the following steps:
- reads the plan
- reads the pre-processed image
- reads the ImageNet categories
- creates
logger: Logger
representing a logger instance - creates
runtime: Runtime
representing a runtime instance - uses
runtime
to deserialize the plan intoengine: ICudaEngine
- creates
context: IExecutionContext
for theengine
- obtains a CUDA stream reference
- allocates a Numpy array to hold the output data on the host
- allocates device memory buffers for the input and output tensors
- specifies input/output bindings as a list holding addresses of all input and output buffers
- copies the input tensor from host to device
- runs inference for the
context
with the specified bindings and CUDA stream handle - copies the output tensor from device to host
- applies the softmax transformation to the output
- gets labels and probabilities for top 5 results
- prints top 5 classes and probabilities in a human-readable form
The program has two command line arguments: a path to the TensorRT plan file and a path to the file containing the pre-processed input image.
To run this program for the previously created ResNet50 plan and husky image, use the command:
python3 trt_infer_plan.py ./plan/resnet50.plan ./data/husky01.dat
The program output will look like:
Siberian husky 49.52%
Eskimo dog 42.90%
malamute 5.87%
dogsled 1.22%
Saint Bernard 0.32%
The inference with TensorRT models can be also implemented using the TensorRT C++ API.
The C++ program trt_infer_plan.cpp
serves this purpose.
#include <cstdio>
#include <cstdlib>
#include <cassert>
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <NvInfer.h>
#include "common.h"
// wrapper class for inference engine
class Engine {
public:
Engine();
~Engine();
public:
void Init(const std::vector<char> &plan);
void Infer(const std::vector<float> &input, std::vector<float> &output);
void DiagBindings();
private:
bool m_active;
Logger m_logger;
UniquePtr<nvinfer1::IRuntime> m_runtime;
UniquePtr<nvinfer1::ICudaEngine> m_engine;
};
Engine::Engine(): m_active(false) { }
Engine::~Engine() { }
void Engine::Init(const std::vector<char> &plan) {
assert(!m_active);
m_runtime.reset(nvinfer1::createInferRuntime(m_logger));
if (m_runtime == nullptr) {
Error("Error creating infer runtime");
}
m_engine.reset(m_runtime->deserializeCudaEngine(plan.data(), plan.size(), nullptr));
if (m_engine == nullptr) {
Error("Error deserializing CUDA engine");
}
m_active = true;
}
void Engine::Infer(const std::vector<float> &input, std::vector<float> &output) {
assert(m_active);
UniquePtr<nvinfer1::IExecutionContext> context;
context.reset(m_engine->createExecutionContext());
if (context == nullptr) {
Error("Error creating execution context");
}
CudaBuffer<float> inputBuffer;
inputBuffer.Init(3 * 224 * 224);
assert(inputBuffer.Size() == input.size());
inputBuffer.Put(input.data());
CudaBuffer<float> outputBuffer;
outputBuffer.Init(1000);
void *bindings[2];
bindings[0] = inputBuffer.Data();
bindings[1] = outputBuffer.Data();
bool ok = context->executeV2(bindings);
if (!ok) {
Error("Error executing inference");
}
output.resize(outputBuffer.Size());
outputBuffer.Get(output.data());
}
void Engine::DiagBindings() {
int nbBindings = static_cast<int>(m_engine->getNbBindings());
printf("Bindings: %d\n", nbBindings);
for (int i = 0; i < nbBindings; i++) {
const char *name = m_engine->getBindingName(i);
bool isInput = m_engine->bindingIsInput(i);
nvinfer1::Dims dims = m_engine->getBindingDimensions(i);
std::string fmtDims = FormatDims(dims);
printf(" [%d] \"%s\" %s [%s]\n", i, name, isInput ? "input" : "output", fmtDims.c_str());
}
}
// I/O utilities
void ReadClasses(const char *path, std::vector<std::string> &classes) {
std::string line;
std::ifstream ifs(path, std::ios::in);
if (!ifs.is_open()) {
Error("Cannot open %s", path);
}
while (std::getline(ifs, line)) {
classes.push_back(line);
}
ifs.close();
}
void ReadPlan(const char *path, std::vector<char> &plan) {
std::ifstream ifs(path, std::ios::in | std::ios::binary);
if (!ifs.is_open()) {
Error("Cannot open %s", path);
}
ifs.seekg(0, ifs.end);
size_t size = ifs.tellg();
plan.resize(size);
ifs.seekg(0, ifs.beg);
ifs.read(plan.data(), size);
ifs.close();
}
void ReadInput(const char *path, std::vector<float> &input) {
std::ifstream ifs(path, std::ios::in | std::ios::binary);
if (!ifs.is_open()) {
Error("Cannot open %s", path);
}
size_t size = 3 * 224 * 224;
input.resize(size);
ifs.read(reinterpret_cast<char *>(input.data()), size * sizeof(float));
ifs.close();
}
void PrintOutput(const std::vector<float> &output, const std::vector<std::string> &classes) {
int top5p[5];
float top5v[5];
TopK(static_cast<int>(output.size()), output.data(), 5, top5p, top5v);
printf("Top-5 results\n");
for (int i = 0; i < 5; i++) {
std::string label = classes[top5p[i]];
float prob = 100.0f * top5v[i];
printf(" [%d] %s %.2f%%\n", i, label.c_str(), prob);
}
}
// main program
int main(int argc, char *argv[]) {
if (argc != 3) {
fprintf(stderr, "Usage: trt_infer_plan <plan_path> <input_path>\n");
return 1;
}
const char *planPath = argv[1];
const char *inputPath = argv[2];
printf("Start %s\n", planPath);
std::vector<std::string> classes;
ReadClasses("imagenet_classes.txt", classes);
std::vector<char> plan;
ReadPlan(planPath, plan);
std::vector<float> input;
ReadInput(inputPath, input);
std::vector<float> output;
Engine engine;
engine.Init(plan);
engine.DiagBindings();
engine.Infer(input, output);
Softmax(static_cast<int>(output.size()), output.data());
PrintOutput(output, classes);
return 0;
}
This program uses the following TensorRT API object classes:
nvinfer1::ILogger
- logger used by several other object classesnvinfer1::IRuntime
- used to deserialize TensorRT plans to TensorRT CUDA enginesnvinfer1::ICudaEngine
- engine for executing inference on built networksnvinfer1::IExecutionContext
- context for executing inference using CUDA engine
Class Engine
holds smart pointers to instances of these objects.
It exposes two principal public methods: Init
and Infer
.
The Init
method performs the following steps:
- creates
m_runtime
representing a runtime instance - uses
m_runtime
to deserialize the plan intom_engine
The Infer
method performs the following steps:
- creates
context
for them_engine
- allocates CUDA memory buffer for the input tensor
- copies the input tensor from host to device
- allocates CUDA memory buffer for the output tensor
- specifies input/output bindings as an array holding addresses of all input and output buffers
- runs inference for the
context
with the specified bindings and CUDA stream handle - copies the output tensor from device to host
The program performs the following steps:
- reads the plan
- reads the pre-processed image
- reads the ImageNet categories
- creates the
engine
and initializes it using theInit
method - runs inference with the
engine
using theInfer
method - applies the softmax transformation to the output
- gets labels and probabilities for top 5 results
- prints top 5 classes and probabilities in a human-readable form
NOTE: In this program we intentionally use the deprecated version of
IRuntime::deserializeCudaEngine
method requiring the last nullptr
argument
because, at the times of writing, using the new version without this
argument sometimes caused unexpected program behavior on the considered
GPU devices. The root cause of this problem is not yet clarified;
there might be an undocumented bug in TensorRT inference library.
The shell script build_trt_infer_plan.sh
must be used to compile and link this program:
#!/bin/bash
mkdir -p ./bin
g++ -o ./bin/trt_infer_plan \
-I /usr/local/cuda/include \
trt_infer_plan.cpp common.cpp \
-L /usr/local/cuda/lib64 -lnvinfer -lcudart
Running this script is straightforward:
./build_trt_infer_plan.sh
The program has two command line arguments: a path to the TensorRT plan file and a path to the file containing the pre-processed input image.
To run this program for the previously created ResNet50 plan and husky image, use the command:
./bin/trt_infer_plan ./plan/resnet50.plan ./data/husky01.dat
The program output will look like:
Bindings: 2
[0] "input.1" input [1 3 224 224]
[1] "495" output [1 1000]
Top-5 results
[0] Siberian husky 49.53%
[1] Eskimo dog 42.90%
[2] malamute 5.87%
[3] dogsled 1.22%
[4] Saint Bernard 0.32%
The Python program trt_bench_plan.py
implements inference benchmarking using
the previously generated TensorRT plan and a pre-processed input image.
import sys
from time import perf_counter
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
def softmax(x):
y = np.exp(x)
sum = np.sum(y)
y /= sum
return y
def topk(x, k):
idx = np.argsort(x)
idx = idx[::-1][:k]
return (idx, x[idx])
def main():
if len(sys.argv) != 2:
sys.exit("Usage: python3 trt_bench_plan.py <plan_path>")
plan_path = sys.argv[1]
print("Start " + plan_path)
# read the plan
with open(plan_path, "rb") as fp:
plan = fp.read()
# generate random input
np.random.seed(1234)
input = np.random.random(3 * 224 * 224)
input = input.astype(np.float32)
# initialize the TensorRT objects
logger = trt.Logger()
runtime = trt.Runtime(logger)
engine = runtime.deserialize_cuda_engine(plan)
context = engine.create_execution_context()
# create device buffers and TensorRT bindings
output = np.zeros((1000), dtype=np.float32)
d_input = cuda.mem_alloc(input.nbytes)
d_output = cuda.mem_alloc(output.nbytes)
bindings = [int(d_input), int(d_output)]
# copy input to device, run inference
cuda.memcpy_htod(d_input, input)
# warm up
for i in range(1, 10):
context.execute_v2(bindings=bindings)
# benchmark
start = perf_counter()
for i in range(1, 100):
context.execute_v2(bindings=bindings)
end = perf_counter()
elapsed = ((end - start) / 100) * 1000
print('Model {0}: elapsed time {1:.2f} ms'.format(plan_path, elapsed))
# record for automated extraction
print('#{0};{1:f}'.format(plan_path, elapsed))
# copy output to host
cuda.memcpy_dtoh(output, d_output)
# apply softmax and get Top-5 results
output = softmax(output)
top5p, top5v = topk(output, 5)
# print results
print("Top-5 results")
for ind, val in zip(top5p, top5v):
print(" {0} {1:.2f}%".format(ind, val * 100))
main()
This program uses the following TensorRT API object classes:
Logger
- logger used by several other object classesRuntime
- used to deserialize TensorRT plans to TensorRT CUDA enginesICudaEngine
- engine for executing inference on built networksIExecutionContext
- context for executing inference using CUDA engine
The program performs the following steps:
- reads the plan
- generates random input
- creates
logger: Logger
representing a logger instance - creates
runtime: Runtime
representing a runtime instance - uses
runtime
to deserialize the plan intoengine: ICudaEngine
- creates
context: IExecutionContext
for theengine
- obtains a CUDA stream reference
- allocates a Numpy array to hold the output data on the host
- allocates device memory buffers for the input and output tensors
- specifies input/output bindings as a list holding addresses of all input and output buffers
- copies the input tensor from host to device
- measures performance by repeated execution of inference for the
context
with the specified bindings and CUDA stream handle - copies the output tensor from device to host
- applies the softmax transformation to the output
- gets labels and probabilities for top 5 results
- prints top 5 classes and probabilities in a human-readable form
The program prints a special formatted line starting with "#"
that
will be later used for automated extraction of performance metrics.
The program uses a path to the TensorRT plan file as its single command line argument.
To run this program for the previously created ResNet50 plan, use the command:
python3 trt_bench_plan.py ./plan/resnet50.plan
The program output will look like:
Model resnet50_py.plan: elapsed time 1.59 ms
Top-5 results
610 6.29%
549 5.21%
446 5.00%
783 3.20%
892 2.93%
The shell script bench_plan_all_py.sh
performs benchmarking of all supported torchvision
models:
#!/bin/bash
echo "#head;TensorRT (Python)"
python3 trt_bench_plan.py ./plan/alexnet.plan
python3 trt_bench_plan.py ./plan/densenet121.plan
python3 trt_bench_plan.py ./plan/densenet161.plan
python3 trt_bench_plan.py ./plan/densenet169.plan
python3 trt_bench_plan.py ./plan/densenet201.plan
python3 trt_bench_plan.py ./plan/mnasnet0_5.plan
python3 trt_bench_plan.py ./plan/mnasnet1_0.plan
python3 trt_bench_plan.py ./plan/mobilenet_v2.plan
python3 trt_bench_plan.py ./plan/mobilenet_v3_large.plan
python3 trt_bench_plan.py ./plan/mobilenet_v3_small.plan
python3 trt_bench_plan.py ./plan/resnet101.plan
python3 trt_bench_plan.py ./plan/resnet152.plan
python3 trt_bench_plan.py ./plan/resnet18.plan
python3 trt_bench_plan.py ./plan/resnet34.plan
python3 trt_bench_plan.py ./plan/resnet50.plan
python3 trt_bench_plan.py ./plan/resnext101_32x8d.plan
python3 trt_bench_plan.py ./plan/resnext50_32x4d.plan
python3 trt_bench_plan.py ./plan/shufflenet_v2_x0_5.plan
python3 trt_bench_plan.py ./plan/shufflenet_v2_x1_0.plan
python3 trt_bench_plan.py ./plan/squeezenet1_0.plan
python3 trt_bench_plan.py ./plan/squeezenet1_1.plan
python3 trt_bench_plan.py ./plan/vgg11.plan
python3 trt_bench_plan.py ./plan/vgg11_bn.plan
python3 trt_bench_plan.py ./plan/vgg13.plan
python3 trt_bench_plan.py ./plan/vgg13_bn.plan
python3 trt_bench_plan.py ./plan/vgg16.plan
python3 trt_bench_plan.py ./plan/vgg16_bn.plan
python3 trt_bench_plan.py ./plan/vgg19.plan
python3 trt_bench_plan.py ./plan/vgg19_bn.plan
python3 trt_bench_plan.py ./plan/wide_resnet101_2.plan
python3 trt_bench_plan.py ./plan/wide_resnet50_2.plan
Running this script is straightforward:
./bench_plan_all_py.sh >bench_trt_py.log
The benchmarking log will be saved in bench_trt_py.log
that later will be
used for performance comparison of various deployment methods.
The benchmarking of TensorRT models can be also implemented using the TensorRT C++ API.
The C++ program trt_bench_plan.cpp
serves this purpose.
#include <cstdio>
#include <cstdlib>
#include <cassert>
#include <vector>
#include <iostream>
#include <fstream>
#include <NvInfer.h>
#include "common.h"
// wrapper class for inference engine
class Engine {
public:
Engine();
~Engine();
public:
void Init(const std::vector<char> &plan);
void StartInfer(const std::vector<float> &input);
void RunInfer();
void EndInfer(std::vector<float> &output);
private:
bool m_active;
Logger m_logger;
UniquePtr<nvinfer1::IRuntime> m_runtime;
UniquePtr<nvinfer1::ICudaEngine> m_engine;
UniquePtr<nvinfer1::IExecutionContext> m_context;
CudaBuffer<float> m_inputBuffer;
CudaBuffer<float> m_outputBuffer;
};
Engine::Engine(): m_active(false) { }
Engine::~Engine() { }
void Engine::Init(const std::vector<char> &plan) {
assert(!m_active);
m_runtime.reset(nvinfer1::createInferRuntime(m_logger));
if (m_runtime == nullptr) {
Error("Error creating infer runtime");
}
m_engine.reset(m_runtime->deserializeCudaEngine(plan.data(), plan.size(), nullptr));
if (m_engine == nullptr) {
Error("Error deserializing CUDA engine");
}
m_active = true;
}
void Engine::StartInfer(const std::vector<float> &input) {
assert(m_active);
m_context.reset(m_engine->createExecutionContext());
if (m_context == nullptr) {
Error("Error creating execution context");
}
m_inputBuffer.Init(3 * 224 * 224);
assert(m_inputBuffer.Size() == input.size());
m_inputBuffer.Put(input.data());
m_outputBuffer.Init(1000);
}
void Engine::RunInfer() {
void *bindings[2];
bindings[0] = m_inputBuffer.Data();
bindings[1] = m_outputBuffer.Data();
bool ok = m_context->executeV2(bindings);
if (!ok) {
Error("Error executing inference");
}
}
void Engine::EndInfer(std::vector<float> &output) {
output.resize(m_outputBuffer.Size());
m_outputBuffer.Get(output.data());
}
// I/O utilities
void ReadPlan(const char *path, std::vector<char> &plan) {
std::ifstream ifs(path, std::ios::in | std::ios::binary);
if (!ifs.is_open()) {
Error("Cannot open %s", path);
}
ifs.seekg(0, ifs.end);
size_t size = ifs.tellg();
plan.resize(size);
ifs.seekg(0, ifs.beg);
ifs.read(plan.data(), size);
ifs.close();
}
void GenerateInput(std::vector<float> &input) {
int size = 3 * 224 * 224;
input.resize(size);
float *p = input.data();
std::srand(1234);
for (int i = 0; i < size; i++) {
p[i] = static_cast<float>(std::rand()) / RAND_MAX;
}
}
void PrintOutput(const std::vector<float> &output) {
int top5p[5];
float top5v[5];
TopK(static_cast<int>(output.size()), output.data(), 5, top5p, top5v);
printf("Top-5 results\n");
for (int i = 0; i < 5; i++) {
int label = top5p[i];
float prob = 100.0f * top5v[i];
printf(" [%d] %d %.2f%%\n", i, label, prob);
}
}
// main program
int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: trt_bench_plan <plan_path>\n");
return 1;
}
const char *planPath = argv[1];
printf("Start %s\n", planPath);
int repeat = 100;
std::vector<char> plan;
ReadPlan(planPath, plan);
std::vector<float> input;
GenerateInput(input);
std::vector<float> output;
Engine engine;
engine.Init(plan);
engine.StartInfer(input);
for (int i = 0; i < 10; i++) {
engine.RunInfer();
}
Timer timer;
timer.Start();
for (int i = 0; i < repeat; i++) {
engine.RunInfer();
}
timer.Stop();
float t = timer.Elapsed();
printf("Model %s: elapsed time %f ms / %d = %f\n", planPath, t, repeat, t / float(repeat));
// record for automated extraction
printf("#%s;%f\n", planPath, t / float(repeat));
engine.EndInfer(output);
Softmax(static_cast<int>(output.size()), output.data());
PrintOutput(output);
return 0;
}
This program uses the following TensorRT API object classes:
nvinfer1::ILogger
- logger used by several other object classesnvinfer1::IRuntime
- used to deserialize TensorRT plans to TensorRT CUDA enginesnvinfer1::ICudaEngine
- engine for executing inference on built networksnvinfer1::IExecutionContext
- context for executing inference using CUDA engine
Class Engine
holds smart pointers to instances of these objects.
It exposes four principal public methods: Init
, StartInfer
,
RunInfer
, and RunInfer
.
The Init
method performs the following steps:
- creates
m_runtime
representing a runtime instance - uses
m_runtime
to deserialize the plan intom_engine
The StartInfer
method performs the following steps:
- creates
m_context
for them_engine
- allocates CUDA memory buffer for the input tensor
- copies the input tensor from host to device
- allocates CUDA memory buffer for the output tensor
The RunInfer
method performs the following steps:
- specifies input/output bindings as an array holding addresses of all input and output buffers
- runs inference for the
m_context
with the specified bindings
The EndInfer
method performs the following step:
- copies the output tensor from device to host
The program performs the following steps:
- reads the plan
- generates random input
- creates the
engine
- initializes inference on the
engine
using theInitInfer
method - measures performance by repeated execution of inference with the
engine
using theRunInfer
method - completes inference on the
engine
using theEndInfer
method - applies the softmax transformation to the output
- gets labels and probabilities for top 5 results
- prints top 5 classes and probabilities in a human-readable form
The program prints a special formatted line starting with "#"
that
will be later used for automated extraction of performance metrics.
NOTE: In this program we intentionally use the deprecated version of
IRuntime::deserializeCudaEngine
method requiring the last nullptr
argument
because, at the times of writing, using the new version without this
argument sometimes caused unexpected program behavior on the considered
GPU devices. The root cause of this problem is not yet clarified;
there might be an undocumented bug in TensorRT inference library.
The shell script build_trt_bench_plan.sh
must be used to compile and link this program:
#!/bin/bash
mkdir -p ./bin
g++ -o ./bin/trt_bench_plan \
-I /usr/local/cuda/include \
trt_bench_plan.cpp common.cpp \
-L /usr/local/cuda/lib64 -lnvinfer -lcudart
Running this script is straightforward:
./build_trt_bench_plan.sh
The program has two command line arguments: a path to the TensorRT plan file and a path to the file containing the pre-processed input image.
To run this program for the previously created ResNet50 plan, use the command:
./bin/trt_bench_plan ./plan/resnet50.plan
The program output will look like:
Model resnet50.plan: elapsed time 179.491653 ms / 100 = 1.794917
Top-5 results
[0] 610 4.25%
[1] 549 3.90%
[2] 783 3.64%
[3] 892 3.51%
[4] 446 3.18%
The shell script bench_plan_all.sh
performs benchmarking of all supported torchvision
models:
#!/bin/bash
echo "#head;TensorRT (C++)"
./bin/trt_bench_plan ./plan/alexnet.plan
./bin/trt_bench_plan ./plan/densenet121.plan
./bin/trt_bench_plan ./plan/densenet161.plan
./bin/trt_bench_plan ./plan/densenet169.plan
./bin/trt_bench_plan ./plan/densenet201.plan
./bin/trt_bench_plan ./plan/mnasnet0_5.plan
./bin/trt_bench_plan ./plan/mnasnet1_0.plan
./bin/trt_bench_plan ./plan/mobilenet_v2.plan
./bin/trt_bench_plan ./plan/mobilenet_v3_large.plan
./bin/trt_bench_plan ./plan/mobilenet_v3_small.plan
./bin/trt_bench_plan ./plan/resnet101.plan
./bin/trt_bench_plan ./plan/resnet152.plan
./bin/trt_bench_plan ./plan/resnet18.plan
./bin/trt_bench_plan ./plan/resnet34.plan
./bin/trt_bench_plan ./plan/resnet50.plan
./bin/trt_bench_plan ./plan/resnext101_32x8d.plan
./bin/trt_bench_plan ./plan/resnext50_32x4d.plan
./bin/trt_bench_plan ./plan/shufflenet_v2_x0_5.plan
./bin/trt_bench_plan ./plan/shufflenet_v2_x1_0.plan
./bin/trt_bench_plan ./plan/squeezenet1_0.plan
./bin/trt_bench_plan ./plan/squeezenet1_1.plan
./bin/trt_bench_plan ./plan/vgg11.plan
./bin/trt_bench_plan ./plan/vgg11_bn.plan
./bin/trt_bench_plan ./plan/vgg13.plan
./bin/trt_bench_plan ./plan/vgg13_bn.plan
./bin/trt_bench_plan ./plan/vgg16.plan
./bin/trt_bench_plan ./plan/vgg16_bn.plan
./bin/trt_bench_plan ./plan/vgg19.plan
./bin/trt_bench_plan ./plan/vgg19_bn.plan
./bin/trt_bench_plan ./plan/wide_resnet101_2.plan
./bin/trt_bench_plan ./plan/wide_resnet50_2.plan
Running this script is straightforward:
./bench_plan_all.sh >bench_trt.log
The benchmarking log will be saved in bench_trt.log
that later will be
used for performance comparison of various deployment methods.
The Python program merge_perf.py
introduced in Article 2 extracts
performance metrics from multiple benchmarking log files and merges them
in a single CSV file in a form suitable for further analysis.
The program has two or more command line arguments, each argument specifying a path to the log file.
The program extracts special records starting with "#"
from all input files,
merges the extracted information, and saves it as a single CSV file.
Each line of the output file corresponds to one model and each column corresponds to
one deployment method.
Assuming that benchmarking described in the Articles 1, 2, and 3 has been
performed in the sibling directories art01
, art02
, and art03
respectively
and the current directory is art03
, the following command can be used to merge the three log
files considered so far:
python3 merge_perf.py ../art01/bench_torch.log ../art02/bench_ts_py.log ../art02/bench_ts.log bench_trt_py.log bench_trt.log >perf03.csv
The output file perf03.csv
will look like:
Model;PyTorch;TorchScript (Python);TorchScript (C++);TensorRT (Python);TensorRT (C++)
alexnet;1.23;1.05;1.04;0.58;0.60
densenet121;19.79;13.65;13.34;3.73;3.67
densenet161;29.43;20.83;20.70;7.99;7.40
densenet169;28.47;19.33;20.11;8.17;7.32
densenet201;33.48;22.44;22.70;12.24;10.96
mnasnet0_5;5.45;3.63;3.67;0.64;0.61
mnasnet1_0;5.66;3.79;3.95;0.80;0.80
mobilenet_v2;6.19;4.12;4.02;0.77;0.76
mobilenet_v3_large;8.07;5.22;5.18;0.98;0.91
mobilenet_v3_small;6.37;4.20;4.19;0.74;0.67
resnet101;15.80;11.01;10.81;3.12;3.18
resnet152;23.66;16.65;16.37;4.57;4.57
resnet18;3.39;2.39;2.30;1.08;1.04
resnet34;6.11;4.22;4.11;1.84;1.79
resnet50;7.99;5.53;5.47;1.75;1.75
resnext101_32x8d;21.69;17.34;16.66;8.06;8.11
resnext50_32x4d;6.45;4.32;4.41;2.13;2.08
shufflenet_v2_x0_5;6.33;4.03;4.01;0.47;0.49
shufflenet_v2_x1_0;6.84;4.58;4.44;0.88;0.86
squeezenet1_0;3.05;2.28;2.33;0.41;0.42
squeezenet1_1;3.03;2.28;2.31;0.31;0.31
vgg11;1.91;1.81;1.84;1.74;1.75
vgg11_bn;2.37;1.93;1.96;1.75;1.75
vgg13;2.26;2.31;2.27;2.16;2.15
vgg13_bn;2.62;2.45;2.43;2.14;2.17
vgg16;2.82;2.75;2.88;2.64;2.61
vgg16_bn;3.23;3.10;3.06;2.61;2.65
vgg19;3.29;3.40;3.40;3.17;3.14
vgg19_bn;3.72;3.64;3.64;3.07;3.13
wide_resnet101_2;15.50;10.89;10.55;5.58;5.45
wide_resnet50_2;7.88;5.52;5.35;2.83;2.95
The Python program tab_perf.py
introduced in Article 2 can be used to display
the CSV data in the tabular format.
To run this program, use the following command line:
python3 tab_perf.py perf03.csv >perf03.txt
The output file perf03.txt
will look like:
Model PyTorch TorchScript (Python) TorchScript (C++) TensorRT (Python) TensorRT (C++)
----------------------------------------------------------------------------------------------------------------------
alexnet 1.23 1.05 1.04 0.58 0.60
densenet121 19.79 13.65 13.34 3.73 3.67
densenet161 29.43 20.83 20.70 7.99 7.40
densenet169 28.47 19.33 20.11 8.17 7.32
densenet201 33.48 22.44 22.70 12.24 10.96
mnasnet0_5 5.45 3.63 3.67 0.64 0.61
mnasnet1_0 5.66 3.79 3.95 0.80 0.80
mobilenet_v2 6.19 4.12 4.02 0.77 0.76
mobilenet_v3_large 8.07 5.22 5.18 0.98 0.91
mobilenet_v3_small 6.37 4.20 4.19 0.74 0.67
resnet101 15.80 11.01 10.81 3.12 3.18
resnet152 23.66 16.65 16.37 4.57 4.57
resnet18 3.39 2.39 2.30 1.08 1.04
resnet34 6.11 4.22 4.11 1.84 1.79
resnet50 7.99 5.53 5.47 1.75 1.75
resnext101_32x8d 21.69 17.34 16.66 8.06 8.11
resnext50_32x4d 6.45 4.32 4.41 2.13 2.08
shufflenet_v2_x0_5 6.33 4.03 4.01 0.47 0.49
shufflenet_v2_x1_0 6.84 4.58 4.44 0.88 0.86
squeezenet1_0 3.05 2.28 2.33 0.41 0.42
squeezenet1_1 3.03 2.28 2.31 0.31 0.31
vgg11 1.91 1.81 1.84 1.74 1.75
vgg11_bn 2.37 1.93 1.96 1.75 1.75
vgg13 2.26 2.31 2.27 2.16 2.15
vgg13_bn 2.62 2.45 2.43 2.14 2.17
vgg16 2.82 2.75 2.88 2.64 2.61
vgg16_bn 3.23 3.10 3.06 2.61 2.65
vgg19 3.29 3.40 3.40 3.17 3.14
vgg19_bn 3.72 3.64 3.64 3.07 3.13
wide_resnet101_2 15.50 10.89 10.55 5.58 5.45
wide_resnet50_2 7.88 5.52 5.35 2.83 2.95
Analysis of these performance data reveals that using TensorRT provides substantial performance increase compared to all previously considered deployment methods.
Differences between TensorRT performance data for Python and C++ are within the experimental error. Python and C++ can be considered equally good for running TensorRT.
This documentation on NVIDIA TensorRT 8.0.3 can be used for further references:
At the time of writing, the detailed API information was available only for version 8.0.1:
The index of documents covering all TensorRT versions is available here.
All recommendations and examples described in the Articles 1, 2, and 3 are also applicable to Genesis Cloud instances equipped with NVIDIA RTX 3090 GPUs. We have conducted benchmarking of inference for the image classification models on the RTX 3090 instance. Here are the results:
Model PyTorch TorchScript (Python) TorchScript (C++) TensorRT (Python) TensorRT (C++)
----------------------------------------------------------------------------------------------------------------------
alexnet 1.30 0.97 1.01 0.52 0.52
densenet121 19.91 13.76 13.80 3.65 3.58
densenet161 29.76 19.78 21.43 7.23 7.19
densenet169 28.93 19.06 19.67 7.03 6.91
densenet201 34.34 21.90 23.97 10.56 10.47
mnasnet0_5 5.55 3.44 3.78 0.63 0.61
mnasnet1_0 5.87 3.68 3.88 0.81 0.79
mobilenet_v2 6.21 3.90 4.21 0.73 0.71
mobilenet_v3_large 7.87 5.38 5.65 0.95 1.03
mobilenet_v3_small 6.49 4.38 4.43 0.70 0.79
resnet101 16.09 10.39 11.19 3.26 3.07
resnet152 24.34 15.38 17.10 4.54 4.55
resnet18 3.37 2.21 2.35 1.05 1.08
resnet34 6.11 4.03 4.25 1.84 1.75
resnet50 8.21 5.67 5.82 1.72 1.75
resnext101_32x8d 22.09 16.38 17.64 7.85 7.97
resnext50_32x4d 6.53 4.13 4.28 2.05 2.12
shufflenet_v2_x0_5 6.53 3.90 4.22 0.49 0.57
shufflenet_v2_x1_0 7.08 4.36 4.68 0.89 0.91
squeezenet1_0 3.16 2.21 2.40 0.40 0.38
squeezenet1_1 3.09 2.15 2.26 0.31 0.31
vgg11 1.92 1.55 1.58 1.50 1.51
vgg11_bn 2.32 1.78 1.84 1.49 1.49
vgg13 2.28 1.99 1.99 1.84 1.85
vgg13_bn 2.71 2.12 2.16 1.84 1.85
vgg16 2.68 2.46 2.50 2.27 2.25
vgg16_bn 3.27 2.72 3.55 2.27 2.28
vgg19 3.01 3.03 3.12 2.69 2.70
vgg19_bn 8.21 3.11 3.18 2.74 2.72
wide_resnet101_2 15.34 10.02 10.65 5.36 5.21
wide_resnet50_2 7.98 5.22 5.63 2.80 2.75
For all the considered models and batch size of 1 there is almost no performance improvement compared to the results for RTX 3080 listed above.