Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure of TensorRT 10.7 to eliminate concatenation with upstream custom layer #4345

Open
jchia opened this issue Feb 4, 2025 · 3 comments
Assignees
Labels
Feature Request Request for new functionality Module:Documentation Lack of clarity in documentation triaged Issue has been triaged by maintainers

Comments

@jchia
Copy link

jchia commented Feb 4, 2025

Description

It seems that TensorRT cannot eliminate a concatenation layer if there is an upstream custom layer.

In a simple model that uses all standard operators, TensorRT engine building eliminates concatenation, but after I replaced Add with a CustomAdd that does the same thing as Add, TensorRT engine building does not eliminate the concatenation.

This failure to eliminate concatenation diminishes the benefit of using plugins when a plugin outputs to a concatenation layer, especially in terms of reducing the number of kernels, since failing to eliminate concatenation typically results in kernels called copyVectorizedKernel being used to do the copying.

From the engine-building log, it appears that the failure is related to a concept called "striding support", but I could not find any documentation on it especially in relation to plugins.

My goal is for the concatenation to also be eliminated in the case involving custom layers, so that there are no unnecessary copyVectorizedKernel kernels. If the current behavior is by design, there should be documentation about this caveat regarding the use of plugins.

Environment

TensorRT Version: 10.7

NVIDIA GPU: RTX 3080

NVIDIA Driver Version: 565.57.01

CUDA Version: 12.7

CUDNN Version: N/A

Operating System: Ubuntu 24.04

Python Version (if applicable): 3.12 (but irrelevant)

Tensorflow Version (if applicable): N/A

PyTorch Version (if applicable): N/A

Baremetal or Container (if so, version): baremetal

Relevant Files

https://github.com/jchia/trt-copy contains all the details to repro a situation illustrating the problem.

Steps To Reproduce

With the content of the repo at https://github.com/jchia/trt-copy, refer to https://github.com/jchia/trt-copy/blob/master/README.md.

The steps are:

$ make plugin.so
$ trtexec --verbose --onnx=sac16.onnx --saveEngine=sac16.plan
$ trtexec --verbose --onnx=sac16c.onnx --saveEngine=sac16c.plan --dynamicPlugins=./plugin.so
$ /opt/nvidia/nsight-compute/2024.3.2/ncu --target-processes all /usr/src/tensorrt/bin/trtexec --loadEngine=sac16.plan
$ /opt/nvidia/nsight-compute/2024.3.2/ncu --target-processes all /usr/src/tensorrt/bin/trtexec --loadEngine=sac16c.plan --dynamicPlugins=./plugin.so

The output of the engine-building steps indicates that concatenation is eliminated when Add is used but not when CustomAdd is used. Details are explained in the README.md.

In particular, for the model with Add (sac16.onnx), there are these lines:

Eliminating concatenation node_of_output
Retargeting part0_plus1 to output

But for the model with CustomAdd (sac16c.onnx), there are these lines:

Eliminating concatenation node_of_output
Generating copy for part0_plus1 to output because input does not support striding.

Commands or scripts:

Have you tried the latest release?: No

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): Haven't tried, but it runs on TensorRT, suboptimally.

@kevinch-nv kevinch-nv added Feature Request Request for new functionality Module:Documentation Lack of clarity in documentation triaged Issue has been triaged by maintainers labels Feb 5, 2025
@kevinch-nv kevinch-nv self-assigned this Feb 5, 2025
@kevinch-nv
Copy link
Collaborator

You are correct, currently the concatenation elimination pass is unsupported for plugin nodes. I'll see to updating the TensorRT developer guide about this.

Do you have motivating use case where the time to copy is dominating over the time saved over using a custom plugin?

@jchia
Copy link
Author

jchia commented Feb 13, 2025

Is there any trick to make concatenation elimination work even with plugin nodes? I don't mind a bit of hacking if it's not too complicated.

My network contains about 20 layers of a custom layer type implemented using a plugin. The application is latency-sensitive, with relatively low data and compute volume that does not saturate the GPU. The batch size is 1.

The custom layer is stateful, meaning each layer takes two inputs: the currently observed data/features and the current state. It produces two outputs: the regular output and the next state.

Because each custom layer is stateful, the entire network is also stateful. During inference, the second network/graph input is the current network state, and the second network/graph output is the next network state, which is fed back into the network for the next inference step. (The first network/graph input and output are the regular features and output.)

The network state is simply the combination of all layer states. When constructing a model, I can choose the number of custom layers. To keep deployment simple, I concatenate all layer output states into a single network output state. This prevents the user of the engine from having to manage a variable number of inputs and outputs dictated by internal implementation details. The user only has to accommodate for varying state sizes from different models.

On an L4, I observe the following:

From ncu profiling: For a model with 21 custom layers, each inference involves 130 kernel launches (With ==PROF== Profiling "some_kernel_name" ... console lines), including 41 copyVectorizedKernel launches.
Without profiling: Each inference (including H2D cudaMemcpyAsync(), cudaGraphLaunch(), and D2H cudaMemcpyAsync()) takes about 190µs.

@jchia
Copy link
Author

jchia commented Feb 13, 2025

I did further experimentation, focused on Slice, which has a related problem. When Slice feeds into a plugin, copyVectorizedKernel is used to perform the Slice instead of letting the plugin read the Slice input with an offset.

I changed my plugin to accept the Slice data input instead of its output, together with an offset for skipping, thus skipping the copyVectorizedKernel. For a particular model I tested, the inference time decreased from 184µs to 137µs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request Request for new functionality Module:Documentation Lack of clarity in documentation triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants