CUDA error: device-side assert triggered #12

JakeRobertBaker · 2022-06-24T21:42:42Z

JakeRobertBaker
Jun 24, 2022

I am trying to run training to reproduce your results. Yes I know I can download your model but I would like to see if I can recreate the training.

Upon running

python mmdet_tools/train.py mmdet_configs/xview_tood/tood_crop_300_500_cls_60.py

Training is fine for a few epochs, then I receive error

`RuntimeError: CUDA error: device-side assert triggered

I have attached the full error message as a txt file

eror.txt

I have only changed the config file to have the filepath for xview data on my machine

fcakyon · 2022-06-24T22:27:09Z

fcakyon
Jun 24, 2022
Maintainer

Hello @JakeRobertBaker, I am sorry to hear that your having errors.

Can you please run it again after setting env variable: CUDA_LAUNCH_BLOCKING=1? Currently, the details of the CUDA error is not present in the error logs.

0 replies

fcakyon · 2022-06-24T22:33:41Z

fcakyon
Jun 24, 2022
Maintainer

BTW the starting lines of the log is not visible, can you also upload/paste it here?

It should look similar to:

sys.platform: linux
Python: 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda-9.0
NVCC: Cuda compilation tools, release 9.0, V9.0.176
GPU 0,1,2,3: GeForce GTX 1080 Ti
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:

GCC 7.3
Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.1
Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.5.0
OpenCV: 4.2.0
MMCV: 0.2.16
MMDetection: 1.0.0+923b70a
MMDetection Compiler: GCC 5.4
MMDetection CUDA Compiler: 9.0

0 replies

JakeRobertBaker · 2022-06-24T22:42:48Z

JakeRobertBaker
Jun 24, 2022
Author

See attached log
20220624_233548.log

See attached error output
error full.txt

Thanks,
Jake

0 replies

fcakyon · 2022-06-24T22:56:05Z

fcakyon
Jun 24, 2022
Maintainer

@JakeRobertBaker have you checked the GPU memory usage? Are you sure CUDA memory is not full? Because as far as I understood it fails after 5 steps of training.

Also have you tried opening an issue to https://github.com/open-mmlab/mmdetection/issues with the log details you have posted here?

0 replies

JakeRobertBaker · 2022-06-24T23:03:31Z

JakeRobertBaker
Jun 24, 2022
Author

I don't think it's memory full because that gives a different error message which I solved by BATCH_MULTIPLIER=1

I could open the issue there, interesting that is it says it's to do with the BCE loss function.

Did you run training with the same command I did. I converted xView to coco format with your script and the analysis of your uploaded weights agreed exactly.

0 replies

JakeRobertBaker · 2022-06-25T12:49:36Z

JakeRobertBaker
Jun 25, 2022
Author

I have done some debugging and discovered the predictions tensor has some NAN values.

I added some debugging code to focal_loss.py

# JB EDIT
if ~((pred >= 0).all() & (pred <= 1).all()):
    print('bad')
    success = torch.logical_and((pred >= 0) , (pred <= 1))
    print(pred[~success])
# JB EDIT 
loss = F.binary_cross_entropy(

tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0', grad_fn=<IndexBackward0>)

Maybe I have a dataset problem. I plan to redownload xView and rerun your scripts.

0 replies

FabienMerceron · 2022-06-28T15:59:31Z

FabienMerceron
Jun 28, 2022

I am currently facing the same problem. I am using xView but slightly customized: I have increased the resolution but adapted the labels. I have rechecked the images and the json file containing the annotations and they are correct. Either my error is exactly the same as @JakeRobertBaker , or my loss first decreases before exploding and eventually the model no longer predicts bounding boxes.
The problem might come from MMdetection but I can't find any solution.

0 replies

fcakyon · 2022-06-28T16:31:31Z

fcakyon
Jun 28, 2022
Maintainer

Can you @FabienMerceron and @JakeRobertBaker report your mmcv cuda and mmdetection versions? I will perform a detailed local inspection today.

0 replies

JakeRobertBaker · 2022-06-28T16:58:52Z

JakeRobertBaker
Jun 28, 2022
Author

I created a fresh CONDA environment and ran:
conda install pytorch=1.10.0 torchvision=0.11.1 cudatoolkit=11.3 -c pytorch
and subsequently ran:
pip install -r requirements.txt.

Therefore my versions should be
mmdet==2.20.0
mmcv-full==1.4.2 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.html

Would you like me to make further checks?

Thank you for the assistance :)

0 replies

JakeRobertBaker · 2022-06-28T17:00:21Z

JakeRobertBaker
Jun 28, 2022
Author

One thing I also noticed is that the xView dataset can have negative x,y coordinates for objects on the image border. Did you clean these out of the data? I am wondering if this is a potential source of problems.

0 replies

fcakyon · 2022-06-28T17:18:09Z

fcakyon
Jun 28, 2022
Maintainer

You are using this script to convert the xview data to coco format, right? https://github.com/fcakyon/small-object-detection-benchmark/blob/main/xview/xview_to_coco.py

0 replies

JakeRobertBaker · 2022-06-28T17:29:31Z

JakeRobertBaker
Jun 28, 2022
Author

Yes I am using that script

0 replies

fcakyon · 2022-06-28T18:47:57Z

fcakyon
Jun 28, 2022
Maintainer

@JakeRobertBaker I have recreated your issue with mmdet==2.20.0 using smaller batch size (for larger batch size I am not having the error) and I believe this PR fixes it: open-mmlab/mmdetection#7090

I have finished full epoch without any error with mmdet==2.21.0 and small batch size. Can you also try with mmdet==2.21.0 ?

If mmdet==2.21.0 fixes it, I will update requirements.txt accordingly 👍

0 replies

JakeRobertBaker · 2022-06-29T09:25:20Z

JakeRobertBaker
Jun 29, 2022
Author

It now seems to be working for me.

One other question. Which torch version are you using?

I have to fix the an error by also running pip install setuptools==59.5.0

0 replies

fcakyon · 2022-06-29T10:53:19Z

fcakyon
Jun 29, 2022
Maintainer

What is your platform?

With the updated requirements file, I am having no error on pytorch=1.10.0: https://github.com/fcakyon/small-object-detection-benchmark/blob/main/requirements.txt

conda install pytorch=1.10.0 torchvision=0.11.1 cudatoolkit=11.3 -c pytorch

pip install -r requirements.txt

0 replies

fcakyon · 2022-06-29T18:24:39Z

fcakyon
Jun 29, 2022
Maintainer

@JakeRobertBaker are you having this issue also with fcos and vfnet models?

0 replies

fcakyon · 2022-06-29T18:27:53Z

fcakyon
Jun 29, 2022
Maintainer

Can you also try after updating train_pipeline in config as:

train_pipeline = [
    dict(type="LoadImageFromFile"),
    dict(type="LoadAnnotations", with_bbox=True),
    dict(type="RandomCrop", crop_type="absolute_range", crop_size=(300, 500), allow_negative_crop=True),
    dict(type="Resize", img_scale=(1333, 800), keep_ratio=True),
    dict(type="RandomFlip", flip_ratio=0.5),
    dict(type="Normalize", **img_norm_cfg),
    dict(type="Pad", size_divisor=32),
    dict(type="DefaultFormatBundle"),
    dict(type="Collect", keys=["img", "gt_bboxes", "gt_labels"]),
]

0 replies

JakeRobertBaker · 2022-06-29T18:38:35Z

JakeRobertBaker
Jun 29, 2022
Author

VF net appears to have worked okay.
I have attempted to train TOOD on a cleaned dataset with negative coordinates removed. This seems to be working.
I will update your config in a later attempt.

0 replies

fcakyon · 2022-06-29T19:29:29Z

fcakyon
Jun 29, 2022
Maintainer

I have attempted to train TOOD on a cleaned dataset with negative coordinates removed. This seems to be working.

Awesome! Can you send your view to coco conversion script including the update for removing negative coordinates?

0 replies

JakeRobertBaker · 2022-06-30T17:18:17Z

JakeRobertBaker
Jun 30, 2022
Author

Thank you, I will upload shortly. Could I also ask why use DATASET_REPEAT=50? Apologies if this question is not relevant to the thread.

0 replies

fcakyon · 2022-06-30T17:59:43Z

fcakyon
Jun 30, 2022
Maintainer

@JakeRobertBaker if you look for the preprocess config it randomly crops a region from the original image at each step. If there are full-shaped 300 images in the training set, it iterates over 300 image crops at each epoch. 1 crop from each image is not enough at each epoch for the model to learn. Repeating training data 50 times means randomly selected 50 crops will be used from each full-shaped training image per epoch. You can use a smaller multiplier if you want 👍

0 replies

FabienMerceron · 2022-07-01T09:00:59Z

FabienMerceron
Jul 1, 2022

I first trained TOOD on xView using exactly same configuration you share in the repo and the training worked. But then, as I said above, I increased the resolution of the xView dataset to try to reach better results. It was at this point I encountered the instability problem we are discussing here, with the model predicting NaN values after a few epochs. My problem was finally solved when I reduced the learning rate from 1e-2 to 1e-3, but the training is obviously much longer. Do you have any idea what TOOD property makes the model more unstable when I increase the resolution ?
Thanks !

0 replies

fcakyon · 2022-07-01T13:30:59Z

fcakyon
Jul 1, 2022
Maintainer

@FabienMerceron there are extra loss terms introduced in TOOD which makes it more unstable and requires a strict parameter arrangement unlike FCOS or other simpler models 👍

0 replies

FabienMerceron · 2022-07-01T13:34:19Z

FabienMerceron
Jul 1, 2022

Ok I ignored that. Thank you very much !

0 replies

FabienMerceron · 2022-07-01T13:45:17Z

FabienMerceron
Jul 1, 2022

By the way I don't know if it is a good place to mention that but for people who have more than 2 workers, it's really worth increasing the workers_per_gpu parameter for training because I've noticed that the dataloader is a strong bottleneck in terms of speed. For example, I have 1 GPU and 12 CPUs, I set workers_per_gpu=10 which speeds up my training by a factor of 5.

0 replies

fcakyon · 2022-07-01T13:53:19Z

fcakyon
Jul 1, 2022
Maintainer

@FabienMerceron thanks a lot for the tip, it's very helpful!

0 replies

JakeRobertBaker · 2022-07-06T07:17:12Z

JakeRobertBaker
Jul 6, 2022
Author

Did you use --gpus=8 to train on multiple gpus?

I can't seem to training to use multiple gpus.

0 replies

fcakyon · 2022-07-06T19:15:54Z

fcakyon
Jul 6, 2022
Maintainer

@JakeRobertBaker I have trained on a single A100 gpu.

0 replies

FabienMerceron · 2022-07-08T12:05:26Z

FabienMerceron
Jul 8, 2022

@fcakyon You might be confusing, I haven't tried filtering invalid notations, @JakeRobertBaker did it

VF net appears to have worked okay. I have attempted to train TOOD on a cleaned dataset with negative coordinates removed. This seems to be working. I will update your config in a later attempt.

0 replies

fcakyon · 2022-07-14T14:22:59Z

fcakyon
Jul 14, 2022
Maintainer

You are right @FabienMerceron, I corrected my comment. Sorry for the confusion.

0 replies

CUDA error: device-side assert triggered #12

JakeRobertBaker Jun 24, 2022

Replies: 32 comments

fcakyon Jun 24, 2022 Maintainer

fcakyon Jun 24, 2022 Maintainer

JakeRobertBaker Jun 24, 2022 Author

fcakyon Jun 24, 2022 Maintainer

JakeRobertBaker Jun 24, 2022 Author

JakeRobertBaker Jun 25, 2022 Author

FabienMerceron Jun 28, 2022

fcakyon Jun 28, 2022 Maintainer

JakeRobertBaker Jun 28, 2022 Author

JakeRobertBaker Jun 28, 2022 Author

fcakyon Jun 28, 2022 Maintainer

JakeRobertBaker Jun 28, 2022 Author

fcakyon Jun 28, 2022 Maintainer

JakeRobertBaker Jun 29, 2022 Author

fcakyon Jun 29, 2022 Maintainer

fcakyon Jun 29, 2022 Maintainer

fcakyon Jun 29, 2022 Maintainer

JakeRobertBaker Jun 29, 2022 Author

fcakyon Jun 29, 2022 Maintainer

JakeRobertBaker Jun 30, 2022 Author

fcakyon Jun 30, 2022 Maintainer

FabienMerceron Jul 1, 2022

fcakyon Jul 1, 2022 Maintainer

FabienMerceron Jul 1, 2022

FabienMerceron Jul 1, 2022

fcakyon Jul 1, 2022 Maintainer

JakeRobertBaker Jul 6, 2022 Author

fcakyon Jul 6, 2022 Maintainer

FabienMerceron Jul 8, 2022

fcakyon Jul 14, 2022 Maintainer

JakeRobertBaker
Jun 24, 2022

fcakyon
Jun 24, 2022
Maintainer

fcakyon
Jun 24, 2022
Maintainer

JakeRobertBaker
Jun 24, 2022
Author

fcakyon
Jun 24, 2022
Maintainer

JakeRobertBaker
Jun 24, 2022
Author

JakeRobertBaker
Jun 25, 2022
Author

FabienMerceron
Jun 28, 2022

fcakyon
Jun 28, 2022
Maintainer

JakeRobertBaker
Jun 28, 2022
Author

JakeRobertBaker
Jun 28, 2022
Author

fcakyon
Jun 28, 2022
Maintainer

JakeRobertBaker
Jun 28, 2022
Author

fcakyon
Jun 28, 2022
Maintainer

JakeRobertBaker
Jun 29, 2022
Author

fcakyon
Jun 29, 2022
Maintainer

fcakyon
Jun 29, 2022
Maintainer

fcakyon
Jun 29, 2022
Maintainer

JakeRobertBaker
Jun 29, 2022
Author

fcakyon
Jun 29, 2022
Maintainer

JakeRobertBaker
Jun 30, 2022
Author

fcakyon
Jun 30, 2022
Maintainer

FabienMerceron
Jul 1, 2022

fcakyon
Jul 1, 2022
Maintainer

FabienMerceron
Jul 1, 2022

FabienMerceron
Jul 1, 2022

fcakyon
Jul 1, 2022
Maintainer

JakeRobertBaker
Jul 6, 2022
Author

fcakyon
Jul 6, 2022
Maintainer

FabienMerceron
Jul 8, 2022

fcakyon
Jul 14, 2022
Maintainer