Training on Flickr Dataset Unexpectedly Hangs #6

mhamzaerol · 2024-01-09T09:51:20Z

Hello,

First of all, thank you very much for this work and your efforts! The repository and guidelines are succinct and pretty effective!

I've encountered a recurring issue while training the large parallel model on the Flickr dataset. The training process unexpectedly hangs - no updates appear in the terminal or the wandb logs. This occurred at approximately 2.7k steps during the first run and around 32k steps in the second. The Conda environment I am using has Python3.10 set, and I was running the experiments on 4 A5000 GPUs.

Currently, I am resuming training from the latest checkpoint by using the resume flag in the training script as a workaround, whenever the training process halts.

I am curious if this is a known issue. Are there components in the code that might cause such behavior, particularly with my setup? Additionally, is resuming training a recommended approach, or are there other flags/settings I should consider?

Any insights or suggestions you can provide would be greatly appreciated.

Thank you!

atosystem · 2024-01-09T09:55:13Z

Hi @mhamzaerol!
Can you show the screenshot "when the training process unexpectedly hangs"?
Thanks

mhamzaerol · 2024-01-11T10:15:16Z

Hi,

Thank you very much for the quick response! Here, I am attaching the screenshots of:

Wandb logs
Terminal outputs

when an instance of a run hangs unexpectedly.

Thank you!

atosystem · 2024-01-15T13:47:04Z

I did not encounter this kind of problem before. I'm guessing this has something to do with multi-gpu training. Does this problem also occur if you use single GPU? (I believe Parallel Base can fit in single GPU with a small batch size)
If single GPU works, I suggest you can look into validation functions in the pytorch training.

mhamzaerol · 2024-01-22T05:55:21Z

Thank you very much for the feedback!
I actually never attempted training in a single GPU setting. I will update you, in case I encounter a similar issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on Flickr Dataset Unexpectedly Hangs #6

Training on Flickr Dataset Unexpectedly Hangs #6

mhamzaerol commented Jan 9, 2024

atosystem commented Jan 9, 2024

mhamzaerol commented Jan 11, 2024

atosystem commented Jan 15, 2024

mhamzaerol commented Jan 22, 2024

Training on Flickr Dataset Unexpectedly Hangs #6

Training on Flickr Dataset Unexpectedly Hangs #6

Comments

mhamzaerol commented Jan 9, 2024

atosystem commented Jan 9, 2024

mhamzaerol commented Jan 11, 2024

atosystem commented Jan 15, 2024

mhamzaerol commented Jan 22, 2024