Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trtexec: onnx to tensorRT convertion fails with --fp16, It reported a Segmentation fault. #4111

Closed
demuxin opened this issue Sep 5, 2024 · 20 comments
Labels
triaged Issue has been triaged by maintainers

Comments

@demuxin
Copy link

demuxin commented Sep 5, 2024

Description

I'm using C++ TensorRT to build on the model, but it's reporting Segmentation fault.

Image

Then I used the trtexec command and the same error was reported:

time trtexec --onnx=model.onnx --fp16

But I can successfully build the engine without specifying the fp16 parameter.

What can be done to troubleshoot this problem?

Environment

TensorRT Version: tensorrt9.3 / tensorrt8.6

NVIDIA GPU: RTX 3090

NVIDIA Driver Version: 535.183.01

CUDA Version: 11.8

Operating System: ubuntu22

@lix19937
Copy link

lix19937 commented Sep 5, 2024

Add --verbose to get more info.

@demuxin
Copy link
Author

demuxin commented Sep 6, 2024

@lix19937 Hi, I add --verbose, but there is nothing useful to debug too.

Image

This is complete log file.

trtexec.log

@lix19937
Copy link

lix19937 commented Sep 6, 2024

Can you upload the onnx file here ?

@demuxin
Copy link
Author

demuxin commented Sep 6, 2024

Yes, I can. But the onnx file is too large (about 1.5G), can you use baidu cloud?

@moraxu moraxu added Precision: FP16 triaged Issue has been triaged by maintainers labels Sep 7, 2024
@lix19937
Copy link

lix19937 commented Sep 7, 2024

@demuxin Can you upload google drive ?

@demuxin
Copy link
Author

demuxin commented Sep 7, 2024

Hi, @lix19937 , this is google drive link of onnx model:

https://drive.google.com/file/d/1YyBqO0GbskV-_3Wc2ljHh5s84bH9ldV_/view?usp=sharing

@lix19937
Copy link

lix19937 commented Sep 7, 2024

@demuxin
I can run pass your break point

[09/07/2024-23:34:26] [V] [TRT] Tactic Name: sm80_xmma_fprop_implicit_gemm_indexed_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x128x64_stage3_warpsize2x2x1_g1_tensor16x8x16 Tactic: 0x866e7a5f6401b67f Time: 1.84978
[09/07/2024-23:34:26] [V] [TRT] Conv_13151 (CaskConvolution[0x80000009]) profiling completed in 0.896959 seconds. Fastest Tactic: 0xa9177bbe4e767df8 Time: 1.61924
[09/07/2024-23:34:26] [V] [TRT] --------------- Timing Runner: Conv_13151 (CaskFlattenConvolution[0x80000036])
[09/07/2024-23:34:26] [V] [TRT] CaskFlattenConvolution has no valid tactics for this config, skipping
[09/07/2024-23:34:26] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: CaskConvolution Tactic: 0xa9177bbe4e767df8
[09/07/2024-23:34:26] [V] [TRT] =============== Computing costs for {ForeignNode[onnx::Cast_19359...Slice_16473]}
[09/07/2024-23:34:26] [V] [TRT] *************** Autotuning format combination: Half(1198080,4680,90,1), Half(299520,1170,45,1), Half(4792320,18720,180,1), Half(76544,299,23,1), Half(19169280,74880,360,1) -> Bool(99749,1,1), Float(25535744,256,1), Float(6000,4,1), Float(30000,20,4,1), Float(96000,64,1), Float(96000,64,1), Float(96000,64,1), Float(96000,64,1), Float(96000,64,1), Float(96000,64,1), Float(96000,64,1), Float(96000,64,1) ***************
[09/07/2024-23:34:26] [V] [TRT] --------------- Timing Runner: {ForeignNode[onnx::Cast_19359...Slice_16473]} (Myelin[0x80000023])
[09/07/2024-23:34:44] [V] [TRT]  (foreignNode) Set user's cuda kernel library
[09/07/2024-23:34:44] [V] [TRT]  (foreignNode) Pass fuse_conv_padding is currently skipped for dynamic shapes
[09/07/2024-23:34:44] [V] [TRT]  (foreignNode) Pass pad_conv_channel is currently skipped for dynamic shapes
[09/07/2024-23:34:44] [V] [TRT]  (foreignNode) Padding large gemms
[09/07/2024-23:34:50] [V] [TRT] Tactic: 0x0000000000000000 Time: 622.841
[09/07/2024-23:34:50] [V] [TRT] {ForeignNode[onnx::Cast_19359...Slice_16473]} (Myelin[0x80000023]) profiling completed in 24.2853 seconds. Fastest Tactic: 0x0000000000000000 Time: 622.841
[09/07/2024-23:34:50] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: Myelin Tactic: 0x0000000000000000
[09/07/2024-23:34:50] [V] [TRT] =============== Computing costs for PWN(Sin_16474)
[09/07/2024-23:34:50] [V] [TRT] *************** Autotuning format combination: Float(96000,64,1) -> Float(96000,64,1) ***************
[09/07/2024-23:34:50] [V] [TRT] --------------- Timing Runner: PWN(Sin_16474) (PointWiseV2[0x80000028])
[09/07/2024-23:34:50] [V] [TRT] Tactic: 0x0000000000000000 Time: 0.00653436
[09/07/2024-23:34:50] [V] [TRT] Tactic: 0x0000000000000001 Time: 0.0166034
[09/07/2024-23:34:50] [V] [TRT] Tactic: 0x0000000000000002 Time: 0.00403073
[09/07/2024-23:34:51] [V] [TRT] Tactic: 0x0000000000000003 Time: 0.00662068
[09/07/2024-23:34:51] [V] [TRT] Tactic: 0x0000000000000004 Time: 0.0042763
[09/07/2024-23:34:51] [V] [TRT] Tactic: 0x0000000000000005 Time: 0.00272336
[09/07/2024-23:34:51] [V] [TRT] Tactic: 0x0000000000000006 Time: 0.00682754
[09/07/2024-23:34:51] [V] [TRT] Tactic: 0x0000000000000007 Time: 0.00331127
[09/07/2024-23:34:51] [V] [TRT] Tactic: 0x0000000000000008 Time: 0.00405803
[09/07/2024-23:34:52] [V] [TRT] Tactic: 0x0000000000000009 Time: 0.009352
[09/07/2024-23:34:52] [V] [TRT] Tactic: 0x000000000000001c Time: 0.0106684
[09/07/2024-23:34:52] [V] [TRT] PWN(Sin_16474) (PointWiseV2[0x80000028]) profiling completed in 1.90085 seconds. Fastest Tactic: 0x0000000000000005 Time: 0.00272336
[09/07/2024-23:34:52] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: PointWiseV2 Tactic: 0x0000000000000005
[09/07/2024-23:34:52] [V] [TRT] *************** Autotuning format combination: Float(1,64,1) -> Float(1,64,1) ***************
[09/07/2024-23:34:52] [V] [TRT] --------------- Timing Runner: PWN(Sin_16474) (PointWiseV2[0x80000028])

But when continue build, it raise error as follow

[09/07/2024-23:36:08] [V] [TRT] Adding reformat layer: Reformatted Input Tensor 3 to {ForeignNode[(Unnamed Layer* 6844) [ElementWise]...Concat_20562]} (onnx::Expand_24379) from Half(2,1) to Float(2,1)
[09/07/2024-23:36:08] [V] [TRT] Formats and tactics selection completed in 404.959 seconds.
[09/07/2024-23:36:08] [V] [TRT] After reformat layers: 369 layers
[09/07/2024-23:36:08] [V] [TRT] Total number of blocks in pre-optimized block assignment: 463
[09/07/2024-23:36:08] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[09/07/2024-23:36:11] [V] [TRT] Deleting timing cache: 2775 entries, served 8732 hits since creation.
[09/07/2024-23:36:11] [E] Error[1]: Unexpected exception vector::_M_range_check: __n (which is 1) >= this->size() (which is 1)
[09/07/2024-23:36:11] [E] Engine could not be created from network
[09/07/2024-23:36:11] [E] Building engine failed
[09/07/2024-23:36:11] [E] Failed to create engine from model or file.
[09/07/2024-23:36:11] [E] Engine set up failed

My TensorRT version is v8601, about Unexpected exception vector::_M_range_check: __n (which is 1) >= this->size() (which is 1) i think is a bug of trt, maybe trt v10 fixed.

Your Segmentation fault maybe due to memory not enough, you can closed other user processes/tasks when build engine.

@demuxin
Copy link
Author

demuxin commented Sep 8, 2024

Thank you for your prompt reply.

I've confirmed that I'm building the engine on RTX 3090, and there's plenty of memory left, so it shouldn't be out of memory issue. It might also a bug of tensorrt.

I tried again with TensorRT 10.3 and it can build engine successfully, but the model output are completely different compared to the fp32 mode.

Do you have any better suggestions for troubleshooting this problem?

@lix19937
Copy link

lix19937 commented Sep 8, 2024

Do you have any better suggestions for troubleshooting this problem?

Can you upload the log ?

@demuxin
Copy link
Author

demuxin commented Sep 8, 2024

I inference successfully using C++ tensorrt api without any error message.

what log do you want me to upload?

@lix19937
Copy link

lix19937 commented Sep 8, 2024

I tried again with TensorRT 10.3 and it can build engine successfully, but the model output are completely different compared to the fp32 mode.

The log of trtexec --fp16 --verbose build .

@demuxin
Copy link
Author

demuxin commented Sep 8, 2024

This is complete trtexec log file on TensorRT 10.3.

trtexec_fp16_v103.log

@lix19937
Copy link

lix19937 commented Sep 8, 2024

Your model has layernorm layer after self-attention, which overflow in fp16, so you should set layernorm in fp32.

@demuxin
Copy link
Author

demuxin commented Sep 8, 2024

Thank you for working so hard.

I'm using netron to visualize onnx model and no layernorm layer is found, do you know what's going on?

and do you know how to set layernoram layer individually to fp32 using tensorrt C++ api and trtexec?

@lix19937
Copy link

lix19937 commented Sep 9, 2024

You can ref trtexec --help set layernoram layer individually to fp32

  --precisionConstraints=spec Control precision constraint setting. (default = none)
                                  Precision Constaints: spec ::= "none" | "obey" | "prefer"
                                  none = no constraints
                                  prefer = meet precision constraints set by --layerPrecisions/--layerOutputTypes if possible
                                  obey = meet precision constraints set by --layerPrecisions/--layerOutputTypes or fail
                                         otherwise
  --layerPrecisions=spec      Control per-layer precision constraints. Effective only when precisionConstraints is set to
                              "obey" or "prefer". (default = none)
                              The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
                              layerName to specify the default precision for all the unspecified layers.
                              Per-layer precision spec ::= layerPrecision[","spec]
                                                  layerPrecision ::= layerName":"precision
                                                  precision ::= "fp32"|"fp16"|"int32"|"int8"
  --layerOutputTypes=spec 

I'm using netron to visualize onnx model and no layernorm layer is found, do you know what's going on?

trt will fusion ln struct nodes.

@demuxin
Copy link
Author

demuxin commented Sep 10, 2024

Hi @lix19937 , I used this command to build tensorrt engine:

trtexec --onnx=codetr_sim.onnx --fp16 --verbose \
    --precisionConstraints=obey \
    --layerPrecisions=layernorm:fp32 \
    --layerOutputTypes=layernorm:fp32

But there is still the following warning:

[09/10/2024-03:23:17] [W] [TRT] Detected layernorm nodes in FP16
[09/10/2024-03:23:17] [W] [TRT] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.

this is build log file:
trtexec_fp16_v103.log

It seems that the setting was not successful, How should I set it up?

Moreover do you know how to set layernoram layer individually to fp32 using tensorrt C++ api ?

@demuxin
Copy link
Author

demuxin commented Sep 11, 2024

I update torch to 1.13, and export onnx with opset 17, this issue can be solved.

@demuxin demuxin closed this as completed Sep 11, 2024
@jinhonglu
Copy link

I had successfully converted fp32 engine from onnx model.

I tried the suggestion above to add --verbose for the fp16 conversion.

Within the log, there is no overflow logging and the result is different compared to the fp32 engine.

Any further investigation?

trt_fp16.log

@lix19937
Copy link

Try to use follow to see the diff

    polygraphy run  $spec_onnx --onnxrt --trt

    polygraphy run  $spec_onnx --onnxrt --trt --fp16    

@jinhonglu
Copy link

jinhonglu commented Jan 23, 2025

by running polygraphy run $spec_onnx --onnxrt --trt

[I]         onnxrt-runner-N0-01/23/25-16:11:45: output | Stats: mean=0.0016378, std-dev=0.0089018, var=7.9242e-05, median=0.00059201, min=-0.2853 at (0, 0, 267, 988, 0), max=0.4366 at (0, 0, 268, 968, 0), avg-magnitude=0.0031753, p90=0.004987, p95=0.004987, p99=0.034112
[I]             ---- Histogram ----
                Bin Range          |  Num Elems | Visualization
                (-0.285 , -0.213 ) |          7 | 
                (-0.213 , -0.141 ) |         82 | 
                (-0.141 , -0.0687) |       2291 | 
                (-0.0687, 0.00353) |    3642282 | ########################################
                (0.00353, 0.0757 ) |     585707 | ######
                (0.0757 , 0.148  ) |       8049 | 
                (0.148  , 0.22   ) |        865 | 
                (0.22   , 0.292  ) |        104 | 
                (0.292  , 0.365  ) |          9 | 
                (0.365  , 0.437  ) |          4 | 
[I]         trt-runner-N0-01/23/25-16:11:45: output | Stats: mean=0.0016377, std-dev=0.0089008, var=7.9224e-05, median=0.00059199, min=-0.28508 at (0, 0, 267, 988, 0), max=0.43679 at (0, 0, 268, 968, 0), avg-magnitude=0.0031751, p90=0.0049865, p95=0.0049865, p99=0.034105
[I]             ---- Histogram ----
                Bin Range          |  Num Elems | Visualization
                (-0.285 , -0.213 ) |          7 | 
                (-0.213 , -0.141 ) |         80 | 
                (-0.141 , -0.0687) |       2288 | 
                (-0.0687, 0.00353) |    3642283 | ########################################
                (0.00353, 0.0757 ) |     585718 | ######
                (0.0757 , 0.148  ) |       8044 | 
                (0.148  , 0.22   ) |        864 | 
                (0.22   , 0.292  ) |        103 | 
                (0.292  , 0.365  ) |          9 | 
                (0.365  , 0.437  ) |          4 | 

by running for polygraphy run $spec_onnx --onnxrt --trt --fp16

[I]         onnxrt-runner-N0-01/23/25-16:07:05: output | Stats: mean=0.0016378, std-dev=0.0089018, var=7.9242e-05, median=0.00059201, min=-0.2853 at (0, 0, 267, 988, 0), max=0.4366 at (0, 0, 268, 968, 0), avg-magnitude=0.0031753, p90=0.004987, p95=0.004987, p99=0.034112
[I]             ---- Histogram ----
                Bin Range          |  Num Elems | Visualization
                (-0.285 , -0.14  ) |         90 | 
                (-0.14  , 0.00457) |    3776629 | ########################################
                (0.00457, 0.15   ) |     461750 | ####
                (0.15   , 0.294  ) |        919 | 
                (0.294  , 0.439  ) |         12 | 
                (0.439  , 0.584  ) |          0 | 
                (0.584  , 0.729  ) |          0 | 
                (0.729  , 0.874  ) |          0 | 
                (0.874  , 1.02   ) |          0 | 
                (1.02   , 1.16   ) |          0 | 
[I]         trt-runner-N0-01/23/25-16:07:05: output | Stats: mean=0.38736, std-dev=0.38967, var=0.15184, median=0.2944, min=-0.07782 at (0, 0, 1261, 271, 1), max=1.1641 at (0, 0, 2048, 887, 0), avg-magnitude=0.39337, p90=0.81445, p95=0.81445, p99=0.85742
[I]             ---- Histogram ----
                Bin Range          |  Num Elems | Visualization
                (-0.285 , -0.14  ) |          0 | 
                (-0.14  , 0.00457) |    1405988 | ################################
                (0.00457, 0.15   ) |     713712 | ################
                (0.15   , 0.294  ) |          0 | 
                (0.294  , 0.439  ) |          0 | 
                (0.439  , 0.584  ) |       2068 | 
                (0.584  , 0.729  ) |     382453 | ########
                (0.729  , 0.874  ) |    1722631 | ########################################
                (0.874  , 1.02   ) |       9447 | 
                (1.02   , 1.16   ) |       3101 | 

I still can't see which layer produces differently.

I tried --trt-outputs mark all --onnx-outputs mark all
However, polygraphy throws an error that Mismatched type for tensor ONNXTRT_Broadcast_3477_output', f16 vs. expected type:f32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants