Skip to content

Save a checkpoint when we press on Stop Training despite the Save Every N Epochs setting & Training optimization #606

@MrGTAmodsgerman

Description

@MrGTAmodsgerman

Is your feature request related to a problem? Please describe.
I have an old GPU (GTX 1080Ti 11GB) + 32GB of RAM and i tested how it performs on training a LoRa. It estimated 167h and 50 min for the training get finished. I didn't expected this this slow, so at some point i stopped training and it haven't reached the first checkpoint save setting yet. So i can't resume on what i already trained. The 1 hour is lost.

Describe the solution you'd like
When i stop training it should safe a checkpoint. You never know when or why you wanna stop a training. So it would be very helpful to not having to rely on a predicted value setting for Save Every N Epochs.

Describe alternatives you've considered
Buying a new GPU

Additional context
I set 600 Epochs for 70 instrumental song files
Image
I trained a Flux LoRA before which took a night. So there is room for optimization i think.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions