-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error while trying to run finetune on real data #26
Comments
Can you please share the full error trace along with the actual command you are running? Are you able to train fine on synthetic data and only get this error for finetuning? |
This is the actual command that tried to run: ./runner.sh net_train.py @configs/net_config_real_resume.txt --checkpoint configs/epoch=48.ckpt I copied epoch=48.ckpt under configs directory, but I tries it with the original place (from synthetic training) as well and it didn't work, same error. I didn't try with synthetic data for finetuning, but that shouldn't be an issue, since the problem here is reading and parsing the checkpoint I think. Samples per epoch 17272Steps per epoch 539Target steps: 240000Actual steps: 240394Epochs: 446Using model class from: /home/jovyan/CenterSnap/simnet/lib/net/models/panoptic_net.pyRestoring from checkpoint: configs/epoch=48.ckptValidation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last):File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feedobj = _ForkingPickler.dumps(obj)File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumpscls(buf, protocol).dump(obj)_pickle.PicklingError: Can't pickle <class 'zstd.ZstdError'>: import of module 'zstd' failedTraceback (most recent call last):File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feedobj = _ForkingPickler.dumps(obj)File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumpscls(buf, protocol).dump(obj)_pickle.PicklingError: Can't pickle <class 'zstd.ZstdError'>: import of module 'zstd' failedTraceback (most recent call last):File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feedobj = _ForkingPickler.dumps(obj)File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumpscls(buf, protocol).dump(obj)_pickle.PicklingError: Can't pickle <class 'zstd.ZstdError'>: import of module 'zstd' failedTraceback (most recent call last):File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feedobj = _ForkingPickler.dumps(obj)File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumpscls(buf, protocol).dump(obj)_pickle.PicklingError: Can't pickle <class 'zstd.ZstdError'>: import of module 'zstd' failed |
Thanks, can you share the config file as well and how are you setting the paths in config file? Did you also generate the real data as described in the readme in the format required by our repo? |
I don't think this error has anything to do with the dataset or config file, since it happens after parsing the config file and while trying to parse the checkpoint, but here is the config file for finetune run, I didn't change anything in that file. Also, the real data also is the same format from the Readme file. Actually, I'm using prepossessed Real data you provided. Do I need to do anything to make them ready for finetuning? --max_steps=240000 |
Thanks for answering my questions, did you mean that you downloaded the data from this link we provided and the untarred real data file in If your file paths are correct, can you make two additional small snippets to reproduce your errors in standalone Python scripts? Unfortunately, I don't know any other reasons for the origination of this error except wrong file paths so if you have doubled checked those, I would 1. Try to load the checkpoint state dict like we do in our notebooks for inference and see if you are able to load it fine 2. try to load a small batch of data maybe just one pickle. zstd file and see if you are able to load and inspect that fine. BTW how did you get the epoch=48 checkpoint? Is it the one that performed the best on the synthetic validation set after training on synthetic data from scratch? |
I ran the training for 50 epochs from scratch using provided synthetic dataset. Is it possible that there is not enough memory for parsing the checkpoint when calling: The problem is that I'm using Kubeflow to run everything, and there are other considerations/parts of the code which prevents me to figure out what's the issue here. |
I don't think GPU memory is the issue when loading checkpoint (since you have trained on synthetic already) and I haven't seen this error on my end. My guess with the zstd file was that data is not loaded properly but looks like your paths and everything else data-wise seem correct. Unfortunately, I don't have any other insights, other than my above two recommendations which I can again mention below:
|
I'm trying to run the finetuning code:
./runner.sh net_train.py @configs/net_config_real_resume.txt --checkpoint \path\to\best\checkpoint
But I get this error: Can't pickle <class 'zstd.ZstdError'>
This happens when the code tries to load the checkpoint in common.py script: torch.load(hparams.checkpoint, map_localization='cpu')['state_dict']
zstandard and all the other dependencies are correctly installed on my machine, and the checkpoint path is correct. Seems there is an issue in parsing the checkpoint, but I don't know what it is. Any suggestion or help would be appreciated.
The text was updated successfully, but these errors were encountered: