Checkpointing? #39

ericphanson · 2021-09-09T12:52:40Z

I think it would make sense for Lighthouse to support per-epoch checkpointing and automatic resumption from checkpoints. I imagine the interface could work like:

MyClassifier <: AbstractClassifier may optionally provide methods for:

load_checkpoint(::Type{MyClassifier}, uri) -> MyClassifier
save_checkpoint(uri, ::MyClassifier) -> Nothing
And can optionally provide a specialized implementation for has_checkpoint(uri, ::AbstractClassifier) = isfile(uri) and checkpoint_extension(::Type{<:AbstractClassifier}) = ".checkpoint".

Then learn! optionally takes a checkpoint_dir URI as a keyword argument.

If no URI is passed, no checkpointing is done.
If one is passed, the classifier must support the load and save methods, and
at the start of each epoch, uri = joinpath(checkpoint_dir, string(epoch), "classifier_checkpoint$(ext)") (with ext = checkpoint_extension(MyClassifier)) is checked for the existence of a checkpoint (by has_checkpoint).
If so, we immediately proceed to the next epoch (without calling callbacks-- presumably those already ran the first time).

Not sure what to do about logs though-- should we also load logs from a checkpoint? I'd like any model-managed state to be part of the AbstractClassifier and have its checkpointing handle any of that state, but I'm not sure about any Lighthouse-managed state, and especially what gaurantees should callbacks have about the availability of that state if we resumed from a checkpoint.

The text was updated successfully, but these errors were encountered:

ericphanson · 2021-09-09T13:09:14Z

Why should this be in Lighthouse? Afterall, you can do whatever you want in the post_epoch_callback or such, and maybe you don't want to checkpoint all the time, so you should choose to do it there, maybe with an upon callback.

The reason I think it needs some Lighthouse interop is not the saving, but the resuming. One way to resume is before you call learn!, choose a checkpoint, load it, load up your model, and then call learn! on that. But then the epoch numbers will be messed up, which would make the logs confusing (e.g. "is epoch 5 in tensorboard really epoch 11 since it resumed from epoch 6?") and makes checkpointing harder.

But one way around that would be instead of hardcoding 1:epoch_limit as the for loop inside learn!, we could just have a keyword argument epochs with a default value of 1:epoch_limit (or replace epoch_limit with something like this) and then the caller could number the epochs however they want.

I think that might be enough to let users do their own checkpointing in the callbacks.

However, I think it could still be useful for Lighthouse to support checkpointing directly, because it could be an easier on-ramp to a fully setup model, rather than having a bunch of ad-hoc stuff in the callbacks you should always do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing? #39

Checkpointing? #39

ericphanson commented Sep 9, 2021

ericphanson commented Sep 9, 2021

Checkpointing? #39

Checkpointing? #39

Comments

ericphanson commented Sep 9, 2021

ericphanson commented Sep 9, 2021