Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing? #39

Open
ericphanson opened this issue Sep 9, 2021 · 1 comment
Open

Checkpointing? #39

ericphanson opened this issue Sep 9, 2021 · 1 comment

Comments

@ericphanson
Copy link
Member

I think it would make sense for Lighthouse to support per-epoch checkpointing and automatic resumption from checkpoints. I imagine the interface could work like:

MyClassifier <: AbstractClassifier may optionally provide methods for:

  • load_checkpoint(::Type{MyClassifier}, uri) -> MyClassifier
  • save_checkpoint(uri, ::MyClassifier) -> Nothing
  • And can optionally provide a specialized implementation for has_checkpoint(uri, ::AbstractClassifier) = isfile(uri) and checkpoint_extension(::Type{<:AbstractClassifier}) = ".checkpoint".

Then learn! optionally takes a checkpoint_dir URI as a keyword argument.

  • If no URI is passed, no checkpointing is done.
  • If one is passed, the classifier must support the load and save methods, and
  • at the start of each epoch, uri = joinpath(checkpoint_dir, string(epoch), "classifier_checkpoint$(ext)") (with ext = checkpoint_extension(MyClassifier)) is checked for the existence of a checkpoint (by has_checkpoint).
  • If so, we immediately proceed to the next epoch (without calling callbacks-- presumably those already ran the first time).

Not sure what to do about logs though-- should we also load logs from a checkpoint? I'd like any model-managed state to be part of the AbstractClassifier and have its checkpointing handle any of that state, but I'm not sure about any Lighthouse-managed state, and especially what gaurantees should callbacks have about the availability of that state if we resumed from a checkpoint.

@ericphanson
Copy link
Member Author

Why should this be in Lighthouse? Afterall, you can do whatever you want in the post_epoch_callback or such, and maybe you don't want to checkpoint all the time, so you should choose to do it there, maybe with an upon callback.

The reason I think it needs some Lighthouse interop is not the saving, but the resuming. One way to resume is before you call learn!, choose a checkpoint, load it, load up your model, and then call learn! on that. But then the epoch numbers will be messed up, which would make the logs confusing (e.g. "is epoch 5 in tensorboard really epoch 11 since it resumed from epoch 6?") and makes checkpointing harder.

But one way around that would be instead of hardcoding 1:epoch_limit as the for loop inside learn!, we could just have a keyword argument epochs with a default value of 1:epoch_limit (or replace epoch_limit with something like this) and then the caller could number the epochs however they want.

I think that might be enough to let users do their own checkpointing in the callbacks.

However, I think it could still be useful for Lighthouse to support checkpointing directly, because it could be an easier on-ramp to a fully setup model, rather than having a bunch of ad-hoc stuff in the callbacks you should always do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant