-
Notifications
You must be signed in to change notification settings - Fork 225
Disable checkpointing properly if openPMD is missing #5493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Disable checkpointing properly if openPMD is missing #5493
Conversation
| } | ||
|
|
||
| // Checkpoints are expected to be sorted chronologically. | ||
| bool const stepFound = std::binary_search(checkpoints.cbegin(), checkpoints.cend(), restartStep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this logic here
5b85557 to
24a14ee
Compare
Moves checkpointing simulation control in its own class Checkpoint class now has a compile time toggle to disable checkpoint/restart functionality Simplified restart state management and changed from using bools to an enum tracking state
24a14ee to
6bbde4f
Compare
chillenzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review.
| else | ||
| { | ||
| // Handle unknown enum values gracefully. | ||
| out << "UNKNOWN_RESTART_STATE(" << static_cast<int>(state) << ")"; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's hope, we've never gotta do this!
| static std::unordered_map<std::string, RestartState> const stringToState | ||
| = {{"disabled", RestartState::DISABLED}, | ||
| {"try", RestartState::TRY}, | ||
| {"force", RestartState::FORCE}, | ||
| {"success", RestartState::SUCCESS}, | ||
| {"failed", RestartState::FAILED}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we extract this? Having to update this in multiple places seems bound to fail. On the other hand, we'll hopefully not really have to update this ever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unified
| // If the token is not a valid state, set the stream's failbit. | ||
| // boost::program_options will catch this and report an error to the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! Dark magic. Why not just throw yourself here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to leave the responsibility of IO to boost po
| bool doSoftRestart() | ||
| { | ||
| static uint32_t nthSoftRestart = 0; | ||
| if(nthSoftRestart <= softRestarts) | ||
| { | ||
| nthSoftRestart++; | ||
| return true; | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not guarded by checkpointingEnabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We allowed restart attempts, even without checkpoints (from the start of the simulation). This only maintains old functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to this class just to keep the user facing checkpoint.restart.loop input
| * @tparam checkpointingEnabled A boolean to enable/disable checkpointing features. | ||
| */ | ||
| template<bool checkpointingEnabled> | ||
| struct Checkpointing | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly dislike the approach of guarding each individual function body with an if constexpr because you have to repeat this over and over again. I would have expected to just provide a specialisation
template<>
struct Checkpointing<false> {
void registerHelp(...) {}
...
};Is there a strong reason for your design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used your suggestion
| // clang-format off | ||
| desc.add_options() | ||
| ("checkpoint.restart.loop", po::value<uint32_t>(&softRestarts)->default_value(0), | ||
| "Number of times to restart the simulation after simulation has finished (for presentations). " | ||
| "Note: does not yet work with all plugins, see issue #1305") | ||
| ("checkpoint.restart", po::value<RestartState>(&restartState)->zero_tokens()->implicit_value(RestartState::FORCE), | ||
| "Restart simulation from a checkpoint. Requires a valid checkpoint.") | ||
| ("checkpoint.tryRestart", po::value<RestartState>(&restartState)->zero_tokens()->implicit_value(RestartState::TRY), | ||
| "Try to restart if a checkpoint is available else start the simulation from scratch.") | ||
| ("checkpoint.restart.directory", po::value<std::string>(&restartDirectory)->default_value(restartDirectory), | ||
| "Directory containing checkpoints for a restart") | ||
| ("checkpoint.restart.step", po::value<int32_t>(&restartStep), | ||
| "Checkpoint step to restart from") | ||
| ("checkpoint.period", po::value<std::string>(&checkpointPeriod), | ||
| "Period for checkpoint creation [interval(s) based on steps]") | ||
| ("checkpoint.timePeriod", po::value<std::uint64_t>(&checkpointPeriodMinutes), | ||
| "Time periodic checkpoint creation [period in minutes]") | ||
| ("checkpoint.directory", po::value<std::string>(&checkpointDirectory)->default_value(checkpointDirectory), | ||
| "Directory for checkpoints"); | ||
| // clang-format on | ||
| // translate checkpointPeriod string into checkpoint intervals | ||
| seqCheckpointPeriod = pluginSystem::toTimeSlice(checkpointPeriod); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diff'ed against the removed parts from SimulationHelper and looks correct.
| if(checkpointTimeThread.joinable()) | ||
| checkpointTimeThread.join(); | ||
|
|
||
| checkpointing.endTimeBasedCheckpointing(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, camel case is not great here. I had to dig into the code to realise that it's not an apocalytic "end-time-based checkpointing" but it's just "ending the time-based checkpointing". Could we rename to "finish"/"cancel"/...?
(I would have been curious what an apocalyptic "end-time-based checkpointing" could look like, though, if you ever want to implement one...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to finishTimeBasedCheckpointing
chillenzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay to me otherwise. Please do the marked renamings.
| bool doSoftRestart() | ||
| { | ||
| static uint32_t nthSoftRestart = 0; | ||
| if(nthSoftRestart <= softRestarts) | ||
| { | ||
| nthSoftRestart++; | ||
| return true; | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename this function. Starting with do... suggests heavily that it actually does something. In the while context it's used in it doesn't read too bad but out of context it's weird. You might want to try something along the lines of moreSoftRestartAttemptsAllowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to canSoftRestart. But as I type it now i think this name has the empty problem. It is not clear if the caller asking or telling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now changed to the more verbose tryConsumeRestartAttempt
| void doTimeBasedCheckpointing() | ||
| { | ||
| if constexpr(checkpointingEnabled) | ||
| { | ||
| // register concurrent thread to perform checkpointing periodically after a user defined time | ||
| if(checkpointPeriodMinutes != 0) | ||
| checkpointTimeThread = std::thread( | ||
| [&, this]() | ||
| { | ||
| std::unique_lock<std::mutex> lk(this->concurrentThreadMutex); | ||
| while(exitConcurrentThreads.wait_until( | ||
| lk, | ||
| std::chrono::system_clock::now() + std::chrono::minutes(checkpointPeriodMinutes)) | ||
| == std::cv_status::timeout) | ||
| { | ||
| signal::detail::setCreateCheckpoint(1); | ||
| } | ||
| }); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: do... suggests that it does something while it actually rather start...s or activate...s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to startTimeBasedCheckpointing to complement the finishTimeBasedCheckpointing
|
@ikbuibui what is the status of this PR? |
|
I still need to address the requested changes, will happen sometime this week. |
Specialized the Checkpointing class template for disabled, i.e. false type Did some renamings Unified the enum to string mapping
e9e750f to
b2f4e59
Compare
Moves checkpointing simulation control in its own class
Checkpoint class now has a compile time toggle to disable checkpoint/restart functionality (fixes #5480)
Simplified restart state management and changed from using bools to an enum tracking state