Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on Flickr Dataset Unexpectedly Hangs #6

Open
mhamzaerol opened this issue Jan 9, 2024 · 4 comments
Open

Training on Flickr Dataset Unexpectedly Hangs #6

mhamzaerol opened this issue Jan 9, 2024 · 4 comments

Comments

@mhamzaerol
Copy link

Hello,

First of all, thank you very much for this work and your efforts! The repository and guidelines are succinct and pretty effective!

I've encountered a recurring issue while training the large parallel model on the Flickr dataset. The training process unexpectedly hangs - no updates appear in the terminal or the wandb logs. This occurred at approximately 2.7k steps during the first run and around 32k steps in the second. The Conda environment I am using has Python3.10 set, and I was running the experiments on 4 A5000 GPUs.

Currently, I am resuming training from the latest checkpoint by using the resume flag in the training script as a workaround, whenever the training process halts.

I am curious if this is a known issue. Are there components in the code that might cause such behavior, particularly with my setup? Additionally, is resuming training a recommended approach, or are there other flags/settings I should consider?

Any insights or suggestions you can provide would be greatly appreciated.

Thank you!

@atosystem
Copy link
Owner

Hi @mhamzaerol!
Can you show the screenshot "when the training process unexpectedly hangs"?
Thanks

@mhamzaerol
Copy link
Author

Hi,

Thank you very much for the quick response! Here, I am attaching the screenshots of:

  • Wandb logs
  • Terminal outputs

when an instance of a run hangs unexpectedly.

wandb terminal

Thank you!

@atosystem
Copy link
Owner

I did not encounter this kind of problem before. I'm guessing this has something to do with multi-gpu training. Does this problem also occur if you use single GPU? (I believe Parallel Base can fit in single GPU with a small batch size)
If single GPU works, I suggest you can look into validation functions in the pytorch training.

@mhamzaerol
Copy link
Author

Thank you very much for the feedback!
I actually never attempted training in a single GPU setting. I will update you, in case I encounter a similar issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants