OOM after automatic batch size finder #19811
Unanswered
JonathanDZiegler
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi and thanks in advance for reading!
I am running into a situation where, on occasion, a training will oom on the first training step after the automatic batch size finder has completed. The callback takes a fixed effective batch size and scales GPU batch size and gradient accumulation steps. I've even backed up the batch size by a factor of 2 after the callback has run to be on the safe side (or so I assumed). Does anyone have any experience with this kind of behavior? The batch size finder should handle all the garbage collection, and I am using the callback in a fairly vanilla way, running on an A100. Thanks for any pointers!
Beta Was this translation helpful? Give feedback.
All reactions