ModelCheckpoint Callback not working unless save_on_train_epoch_end
is enabled True
#20195
Unanswered
snknitin
asked this question in
code help: RL / MetaLearning
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm using the pytorch-lightning + hydra template for my
Custom RL project
. Model basically runs super fast but if i enable checkpointing it is super slow. almost 4x time. There must be something wrong in my setting or the way I am doing this, and after wracking my brain for the whole day and running 30 experiments with different trial and error combinations I am lost.The loggers are set appropriately to show me
env_step, episode level metrics based on on_step and on_epoch
parameters. Everything works perfect except the checkpointing and early stopping too. Never triggers.HELP NEEDED : To figure out how to make this work with just the save_last checkpoint flag and create the model checkpoint directory and save the last one run instead of checking each step, delaying it and then picking the best checkpoint. Since it is RL i don't expect reward monitoring to show any degradation after convergence
Context
There is a caveat here.
train_step
and no other hooks excepton_train_start
. When i start my trainer and set max_epochs=1000, that also means my trainer/global_step will go till 1000 and train_step is called 1000 times. on_train_star is just called once. So each trainstep is one epoch1000 epochs = 1000 steps = 1000 buffer updates = 10 episodes = 1000 batches sampled for training
.Problem
save_on_train_epoch_end: True
.save_last:True
every_n_epochs
and matched with trainer :check_val_every_n_epochs
it still goes very slow(same time) but gives me the 99th check point of 499 or 999 depending on choosing 100,500,1000 rather than monitoring the log I chose and giving me the best one, which i think is expected,Basically I do not now how to get it to trigger and save the checkpoint without the flag save_on_train_epoch_end and how to get it to be fast and not check every epoch which in my case is every time step because i need to run this for 100000 epochs/steps. If atleast i can save a checkpoint every 10000 steps even if i am not monitoring and getting the best model with the "moving avg reward across 10 episodes", that is fine cause eventually it converges and there's not much difference
BEST CASE : If i can get it to create a checkpoint directory and just save the last epoch without having to use the epoch end flag and slow down the whole training and experimentation.
Code Snippets
This is the callback config I use.
and my LightningModule is basically along the lines of
Beta Was this translation helpful? Give feedback.
All reactions