-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training on Flickr Dataset Unexpectedly Hangs #6
Comments
Hi @mhamzaerol! |
I did not encounter this kind of problem before. I'm guessing this has something to do with multi-gpu training. Does this problem also occur if you use single GPU? (I believe Parallel Base can fit in single GPU with a small batch size) |
Thank you very much for the feedback! |
Hello,
First of all, thank you very much for this work and your efforts! The repository and guidelines are succinct and pretty effective!
I've encountered a recurring issue while training the large parallel model on the Flickr dataset. The training process unexpectedly hangs - no updates appear in the terminal or the wandb logs. This occurred at approximately 2.7k steps during the first run and around 32k steps in the second. The Conda environment I am using has Python3.10 set, and I was running the experiments on 4 A5000 GPUs.
Currently, I am resuming training from the latest checkpoint by using the resume flag in the training script as a workaround, whenever the training process halts.
I am curious if this is a known issue. Are there components in the code that might cause such behavior, particularly with my setup? Additionally, is resuming training a recommended approach, or are there other flags/settings I should consider?
Any insights or suggestions you can provide would be greatly appreciated.
Thank you!
The text was updated successfully, but these errors were encountered: