OOM after automatic batch size finder #19811

JonathanDZiegler · 2024-04-25T12:06:17Z

JonathanDZiegler
Apr 25, 2024

Hi and thanks in advance for reading!

I am running into a situation where, on occasion, a training will oom on the first training step after the automatic batch size finder has completed. The callback takes a fixed effective batch size and scales GPU batch size and gradient accumulation steps. I've even backed up the batch size by a factor of 2 after the callback has run to be on the safe side (or so I assumed). Does anyone have any experience with this kind of behavior? The batch size finder should handle all the garbage collection, and I am using the callback in a fairly vanilla way, running on an A100. Thanks for any pointers!

class MinGradAccumulationStepsFinder(BatchSizeFinder):
    """
    A class for finding the largest possible gpu batch size that can be used to train a model with a given effective
        batch size (achieved by gradient accumulation).

    """

    def __init__(self, effective_batch_size: int, init_val: int = 1, **kwargs):
        """
        Args:
        effective_batch_size: The effective batch size which is defined by the gpu batch size and the number of
            gradient accumulation steps
        init_val: The initial value for the batch size finder

        """
        if kwargs.get("mode", "power") != "power":
            raise ValueError("MinGradAccumulationStepsFinder only supports power mode")

        if (effective_batch_size & (effective_batch_size-1))==0:
            raise ValueError(f"Effective batch size must be a power of 2, got {effective_batch_size}")

        # limit batch size to the maximum allowed value
        max_trials = int(np.log2(effective_batch_size) - np.log2(init_val) + 1)

        super().__init__(init_val=init_val, max_trials=max_trials, mode="power", **kwargs)
        self.effective_batch_size = effective_batch_size

    def on_fit_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.scale_batch_size(trainer, pl_module)

        # step back one scaling of batch size to prevent OOMing
        self.optimal_batch_size = self.optimal_batch_size // 2

        trainer.accumulate_grad_batches = self.effective_batch_size // self.optimal_batch_size
        trainer.val_check_interval *= trainer.accumulate_grad_batches
        logging.debug(
            f"Setting accumulate_grad_batches to {trainer.accumulate_grad_batches}. "
            f"This matches the effective batch size of {self.effective_batch_size} "
            f"with the GPU batch size of {self.optimal_batch_size}"
        )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM after automatic batch size finder #19811

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

OOM after automatic batch size finder #19811

Uh oh!

JonathanDZiegler Apr 25, 2024

Replies: 0 comments

JonathanDZiegler
Apr 25, 2024