-
Notifications
You must be signed in to change notification settings - Fork 42
Rename epoch to mini_epoch #1190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks Till for taking this on. I'd appreciate if we (temporarily) could make the code backward compatible, such that old models can still be loaded. @clessig, do you think backward compatibility would be reasonable here? I have left comments at code-parts where we load .json or .ckpts files. Hope I cought all occurances.
If we agree against backward compatibility, I will go forward and test the code. Have only inspected it so far, not tested.
| """ | ||
| run_results = Path(_load_private_conf(None)["path_shared_working_dir"]) / f"results/{run_id}" | ||
| zarr_path = run_results / f"validation_epoch{epoch:05d}_rank{rank:04d}.zarr" | ||
| zarr_path = run_results / f"validation_chkpt{mini_epoch:05d}_rank{rank:04d}.zarr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this backwards-compatible? Something like (not tested)
zarr_path_old = run_results / f"validation_epoch{mini_epoch:05d}_rank{rank:04d}.zarr"
zarr_path_new = run_results / f"validation_chkpt{mini_epoch:05d}_rank{rank:04d}.zarr"
zarr_path = zarr_path_new if zarr_path_new.exists() else zarr_path_oldThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check if this method is still in use? Ideally the knowledge of the specifics of the path should be encapsulated in get_output_path and anywhere this information is required get_output_path should be used instead. A simple backward compatibility mechanism such as @MatKbauer describes can be then implemented there.
|
|
||
| self.fname_zarr = self.results_dir.joinpath( | ||
| f"validation_epoch{self.epoch:05d}_rank{self.rank:04d}.zarr" | ||
| f"validation_chkpt{self.mini_epoch:05d}_rank{self.rank:04d}.zarr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backward compatibility
| def get_path_output(config: Config, mini_epoch: int) -> Path: | ||
| base_path = get_path_run(config) | ||
| fname = f"validation_epoch{epoch:05d}_rank{config.rank:04d}.zarr" | ||
| fname = f"validation_chkpt{mini_epoch:05d}_rank{config.rank:04d}.zarr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backward compatibility?
| mini_epoch_id = ( | ||
| f"chkpt{mini_epoch:05d}" if mini_epoch != -1 and mini_epoch is not None else "latest" | ||
| ) | ||
| filename = f"{run_id}_{mini_epoch_id}.chkpt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backward compatibility
| mini_epoch_id = ( | ||
| f"chkpt{mini_epoch:05d}" if mini_epoch != -1 and mini_epoch is not None else "latest" | ||
| ) | ||
| filename = f"{run_id}_{mini_epoch_id}.chkpt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backward compatibility
tjhunter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclined (weak yes) to break compatibility with this PR:
- we are changing inputs (config) and outputs (checkpoints)
- we are still at a stage when it is easy to do, later stages will be harder/more painful
For the outputs, rather than the mini-epoch index, could we store the value of the mini-epoch? This way, we are more flexible in the future for having different increments such as every 500 optimizer steps etc. rather than being epoch based. Sorry if this was discussed in the design doc.
I don't think it's difficult to ensure backward compatability if we introduce a function that abstracts the checkpoint file name, and similar if we have a problem elsewhere. For the CI, we can just switch to the new name and properly publize it. |
Description
I renamed epoch to mini_epoch and used chkpt for file names.
Epoch didn't represent what was happening correctly. We never passed over the entire dataset, but just 4096 samples. Therefore we decided to rename it to mini_epoch.
Issue Number
Closes #515
partially addresses issue but doesn't fix it entirely
Checklist before asking for review
./scripts/actions.sh lint./scripts/actions.sh unit-test./scripts/actions.sh integration-testlaunch-slurm.py --time 60