Skip to content

Conversation

@TillHae
Copy link
Contributor

@TillHae TillHae commented Oct 31, 2025

Description

I renamed epoch to mini_epoch and used chkpt for file names.
Epoch didn't represent what was happening correctly. We never passed over the entire dataset, but just 4096 samples. Therefore we decided to rename it to mini_epoch.

Issue Number

Closes #515

partially addresses issue but doesn't fix it entirely

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

Copy link
Contributor

@MatKbauer MatKbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks Till for taking this on. I'd appreciate if we (temporarily) could make the code backward compatible, such that old models can still be loaded. @clessig, do you think backward compatibility would be reasonable here? I have left comments at code-parts where we load .json or .ckpts files. Hope I cought all occurances.

If we agree against backward compatibility, I will go forward and test the code. Have only inspected it so far, not tested.

"""
run_results = Path(_load_private_conf(None)["path_shared_working_dir"]) / f"results/{run_id}"
zarr_path = run_results / f"validation_epoch{epoch:05d}_rank{rank:04d}.zarr"
zarr_path = run_results / f"validation_chkpt{mini_epoch:05d}_rank{rank:04d}.zarr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this backwards-compatible? Something like (not tested)

zarr_path_old = run_results / f"validation_epoch{mini_epoch:05d}_rank{rank:04d}.zarr"
zarr_path_new = run_results / f"validation_chkpt{mini_epoch:05d}_rank{rank:04d}.zarr"
zarr_path = zarr_path_new if zarr_path_new.exists() else zarr_path_old

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if this method is still in use? Ideally the knowledge of the specifics of the path should be encapsulated in get_output_path and anywhere this information is required get_output_path should be used instead. A simple backward compatibility mechanism such as @MatKbauer describes can be then implemented there.


self.fname_zarr = self.results_dir.joinpath(
f"validation_epoch{self.epoch:05d}_rank{self.rank:04d}.zarr"
f"validation_chkpt{self.mini_epoch:05d}_rank{self.rank:04d}.zarr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backward compatibility

def get_path_output(config: Config, mini_epoch: int) -> Path:
base_path = get_path_run(config)
fname = f"validation_epoch{epoch:05d}_rank{config.rank:04d}.zarr"
fname = f"validation_chkpt{mini_epoch:05d}_rank{config.rank:04d}.zarr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backward compatibility?

mini_epoch_id = (
f"chkpt{mini_epoch:05d}" if mini_epoch != -1 and mini_epoch is not None else "latest"
)
filename = f"{run_id}_{mini_epoch_id}.chkpt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backward compatibility

mini_epoch_id = (
f"chkpt{mini_epoch:05d}" if mini_epoch != -1 and mini_epoch is not None else "latest"
)
filename = f"{run_id}_{mini_epoch_id}.chkpt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backward compatibility

Copy link
Collaborator

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined (weak yes) to break compatibility with this PR:

  • we are changing inputs (config) and outputs (checkpoints)
  • we are still at a stage when it is easy to do, later stages will be harder/more painful

For the outputs, rather than the mini-epoch index, could we store the value of the mini-epoch? This way, we are more flexible in the future for having different increments such as every 500 optimizer steps etc. rather than being epoch based. Sorry if this was discussed in the design doc.

@clessig
Copy link
Collaborator

clessig commented Nov 6, 2025

I am inclined (weak yes) to break compatibility with this PR:

  • we are changing inputs (config) and outputs (checkpoints)
  • we are still at a stage when it is easy to do, later stages will be harder/more painful

For the outputs, rather than the mini-epoch index, could we store the value of the mini-epoch? This way, we are more flexible in the future for having different increments such as every 500 optimizer steps etc. rather than being epoch based. Sorry if this was discussed in the design doc.

I don't think it's difficult to ensure backward compatability if we introduce a function that abstracts the checkpoint file name, and similar if we have a problem elsewhere. For the CI, we can just switch to the new name and properly publize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Use steps instead of epochs as a basic progress unit

5 participants