CUDA error: device-side assert triggered #12
Replies: 32 comments
-
Hello @JakeRobertBaker, I am sorry to hear that your having errors. Can you please run it again after setting env variable: |
Beta Was this translation helpful? Give feedback.
-
BTW the starting lines of the log is not visible, can you also upload/paste it here? It should look similar to: sys.platform: linux
Python: 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda-9.0
NVCC: Cuda compilation tools, release 9.0, V9.0.176
GPU 0,1,2,3: GeForce GTX 1080 Ti
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:
GCC 7.3
Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.1
Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.5.0
OpenCV: 4.2.0
MMCV: 0.2.16
MMDetection: 1.0.0+923b70a
MMDetection Compiler: GCC 5.4
MMDetection CUDA Compiler: 9.0 |
Beta Was this translation helpful? Give feedback.
-
See attached log See attached error output Thanks, |
Beta Was this translation helpful? Give feedback.
-
@JakeRobertBaker have you checked the GPU memory usage? Are you sure CUDA memory is not full? Because as far as I understood it fails after 5 steps of training. Also have you tried opening an issue to https://github.com/open-mmlab/mmdetection/issues with the log details you have posted here? |
Beta Was this translation helpful? Give feedback.
-
I don't think it's memory full because that gives a different error message which I solved by I could open the issue there, interesting that is it says it's to do with the BCE loss function. Did you run training with the same command I did. I converted xView to coco format with your script and the analysis of your uploaded weights agreed exactly. |
Beta Was this translation helpful? Give feedback.
-
I have done some debugging and discovered the predictions tensor has some NAN values. I added some debugging code to focal_loss.py
Maybe I have a dataset problem. I plan to redownload xView and rerun your scripts. |
Beta Was this translation helpful? Give feedback.
-
I am currently facing the same problem. I am using xView but slightly customized: I have increased the resolution but adapted the labels. I have rechecked the images and the json file containing the annotations and they are correct. Either my error is exactly the same as @JakeRobertBaker , or my loss first decreases before exploding and eventually the model no longer predicts bounding boxes. |
Beta Was this translation helpful? Give feedback.
-
Can you @FabienMerceron and @JakeRobertBaker report your mmcv cuda and mmdetection versions? I will perform a detailed local inspection today. |
Beta Was this translation helpful? Give feedback.
-
I created a fresh CONDA environment and ran: Therefore my versions should be Would you like me to make further checks? Thank you for the assistance :) |
Beta Was this translation helpful? Give feedback.
-
One thing I also noticed is that the xView dataset can have negative x,y coordinates for objects on the image border. Did you clean these out of the data? I am wondering if this is a potential source of problems. |
Beta Was this translation helpful? Give feedback.
-
You are using this script to convert the xview data to coco format, right? https://github.com/fcakyon/small-object-detection-benchmark/blob/main/xview/xview_to_coco.py |
Beta Was this translation helpful? Give feedback.
-
Yes I am using that script |
Beta Was this translation helpful? Give feedback.
-
@JakeRobertBaker I have recreated your issue with I have finished full epoch without any error with If |
Beta Was this translation helpful? Give feedback.
-
It now seems to be working for me. One other question. Which torch version are you using? I have to fix the an error by also running |
Beta Was this translation helpful? Give feedback.
-
What is your platform? With the updated requirements file, I am having no error on conda install pytorch=1.10.0 torchvision=0.11.1 cudatoolkit=11.3 -c pytorch pip install -r requirements.txt |
Beta Was this translation helpful? Give feedback.
-
@JakeRobertBaker are you having this issue also with fcos and vfnet models? |
Beta Was this translation helpful? Give feedback.
-
Can you also try after updating train_pipeline = [
dict(type="LoadImageFromFile"),
dict(type="LoadAnnotations", with_bbox=True),
dict(type="RandomCrop", crop_type="absolute_range", crop_size=(300, 500), allow_negative_crop=True),
dict(type="Resize", img_scale=(1333, 800), keep_ratio=True),
dict(type="RandomFlip", flip_ratio=0.5),
dict(type="Normalize", **img_norm_cfg),
dict(type="Pad", size_divisor=32),
dict(type="DefaultFormatBundle"),
dict(type="Collect", keys=["img", "gt_bboxes", "gt_labels"]),
] |
Beta Was this translation helpful? Give feedback.
-
VF net appears to have worked okay. |
Beta Was this translation helpful? Give feedback.
-
Awesome! Can you send your view to coco conversion script including the update for removing negative coordinates? |
Beta Was this translation helpful? Give feedback.
-
Thank you, I will upload shortly. Could I also ask why use |
Beta Was this translation helpful? Give feedback.
-
@JakeRobertBaker if you look for the preprocess config it randomly crops a region from the original image at each step. If there are full-shaped 300 images in the training set, it iterates over 300 image crops at each epoch. 1 crop from each image is not enough at each epoch for the model to learn. Repeating training data 50 times means randomly selected 50 crops will be used from each full-shaped training image per epoch. You can use a smaller multiplier if you want 👍 |
Beta Was this translation helpful? Give feedback.
-
I first trained TOOD on xView using exactly same configuration you share in the repo and the training worked. But then, as I said above, I increased the resolution of the xView dataset to try to reach better results. It was at this point I encountered the instability problem we are discussing here, with the model predicting NaN values after a few epochs. My problem was finally solved when I reduced the learning rate from 1e-2 to 1e-3, but the training is obviously much longer. Do you have any idea what TOOD property makes the model more unstable when I increase the resolution ? |
Beta Was this translation helpful? Give feedback.
-
@FabienMerceron there are extra loss terms introduced in TOOD which makes it more unstable and requires a strict parameter arrangement unlike FCOS or other simpler models 👍 |
Beta Was this translation helpful? Give feedback.
-
Ok I ignored that. Thank you very much ! |
Beta Was this translation helpful? Give feedback.
-
By the way I don't know if it is a good place to mention that but for people who have more than 2 workers, it's really worth increasing the workers_per_gpu parameter for training because I've noticed that the dataloader is a strong bottleneck in terms of speed. For example, I have 1 GPU and 12 CPUs, I set workers_per_gpu=10 which speeds up my training by a factor of 5. |
Beta Was this translation helpful? Give feedback.
-
@FabienMerceron thanks a lot for the tip, it's very helpful! |
Beta Was this translation helpful? Give feedback.
-
Did you use --gpus=8 to train on multiple gpus? I can't seem to training to use multiple gpus. |
Beta Was this translation helpful? Give feedback.
-
@JakeRobertBaker I have trained on a single A100 gpu. |
Beta Was this translation helpful? Give feedback.
-
@fcakyon You might be confusing, I haven't tried filtering invalid notations, @JakeRobertBaker did it
|
Beta Was this translation helpful? Give feedback.
-
You are right @FabienMerceron, I corrected my comment. Sorry for the confusion. |
Beta Was this translation helpful? Give feedback.
-
I am trying to run training to reproduce your results. Yes I know I can download your model but I would like to see if I can recreate the training.
Upon running
python mmdet_tools/train.py mmdet_configs/xview_tood/tood_crop_300_500_cls_60.py
Training is fine for a few epochs, then I receive error
`RuntimeError: CUDA error: device-side assert triggered
I have attached the full error message as a txt file
eror.txt
I have only changed the config file to have the filepath for xview data on my machine
Beta Was this translation helpful? Give feedback.
All reactions