Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to train the model #16

Open
Sally551 opened this issue Jan 16, 2024 · 7 comments
Open

how to train the model #16

Sally551 opened this issue Jan 16, 2024 · 7 comments

Comments

@Sally551
Copy link

hi, i am training the model using kitti dataset only, but i faced a problem. When I trained the model, it appears that the ./ssd/kitti scene/training/calib/000044.txt doesn't exists, and may i ask that is there a calib file for kitti dataset training, or calib cam_to_cam is the calib file. Here is a list of files i can fetch in the training section. Is a calib file provided but not in my list. If there is, could you please share me the link of that calib file?
help

@gengshan-y
Copy link
Owner

Here you go calib.zip

@Sally551
Copy link
Author

./ssd//kitti_scene/training/disp_occ_0_ganet/000044_10.png
what about this file /disp_occ_0_ganet/

@Sally551
Copy link
Author

image

@gengshan-y
Copy link
Owner

@Sally551
Copy link
Author

RuntimeError: All input tensors must be on the same device. Received cpu and cuda:0
how can i deal with this problem. I am currently training only using 1 GPU. Also, when I am training with 4 GPUs, there is a stuck with distributed dataparallel training as well. How to cope with this problem?

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729062494/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
(this is also a problem i am facing when i am running with 4 CUDA)

@gengshan-y
Copy link
Owner

Typically you can solve the first one by moving the tensor that was on CPU to cuda with .cuda().

I haven't seen the second issue before but with a google search there seems to be some solutions like this

@Sally551
Copy link
Author

thx, i have already run some niters, but not all of them becasue of time limit. I used one of the checkpoint from the predefined logname ( finetune_49999.pth for example). I just want to evaluate kitti-sceneflow (stereo tab 6), and also generate results for kitti-sceneflow benchmark (stereo setup, Tab. 3),
but i failed to do so
image
it said i have mismatch of the size, but i can do it with the pretrained weights. i don't know where is the problem, and here is my command:CUDA_VISIBLE_DEVICES=1 python submission.py --dataset 2015test --datapath ./ssd/kitti_scene/testing/ --outdir ./weights/test1/ --loadmodel ./weights/test1/finetune_49999.pth --disp_path input/disp/kittisf-test-ganet-disp/ --fac 2 --maxdisp 512 --refine --sensor stereo

here is my check point file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants