Skip to content

Conversation

@ikbuibui
Copy link
Contributor

@ikbuibui ikbuibui commented Sep 29, 2025

Moves checkpointing simulation control in its own class
Checkpoint class now has a compile time toggle to disable checkpoint/restart functionality (fixes #5480)
Simplified restart state management and changed from using bools to an enum tracking state

}

// Checkpoints are expected to be sorted chronologically.
bool const stepFound = std::binary_search(checkpoints.cbegin(), checkpoints.cend(), restartStep);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this logic here

@ikbuibui ikbuibui force-pushed the disable_checkpointing branch from 5b85557 to 24a14ee Compare October 2, 2025 10:49
@ikbuibui ikbuibui added the component: core in PIConGPU (core application) label Oct 2, 2025
Moves checkpointing simulation control in its own class
Checkpoint class now has a compile time toggle to disable
checkpoint/restart functionality
Simplified restart state management and changed from using bools to an
enum tracking state
@ikbuibui ikbuibui force-pushed the disable_checkpointing branch from 24a14ee to 6bbde4f Compare October 7, 2025 08:15
@ikbuibui ikbuibui changed the title Disable checkpointing Disable checkpointing properly if openPMD is missing Oct 28, 2025
@ikbuibui ikbuibui marked this pull request as ready for review October 30, 2025 07:27
Copy link
Contributor

@chillenzer chillenzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review.

Comment on lines +80 to +84
else
{
// Handle unknown enum values gracefully.
out << "UNKNOWN_RESTART_STATE(" << static_cast<int>(state) << ")";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hope, we've never gotta do this!

Comment on lines 96 to 101
static std::unordered_map<std::string, RestartState> const stringToState
= {{"disabled", RestartState::DISABLED},
{"try", RestartState::TRY},
{"force", RestartState::FORCE},
{"success", RestartState::SUCCESS},
{"failed", RestartState::FAILED}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we extract this? Having to update this in multiple places seems bound to fail. On the other hand, we'll hopefully not really have to update this ever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unified

Comment on lines +112 to +113
// If the token is not a valid state, set the stream's failbit.
// boost::program_options will catch this and report an error to the user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Dark magic. Why not just throw yourself here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to leave the responsibility of IO to boost po

Comment on lines 225 to 234
bool doSoftRestart()
{
static uint32_t nthSoftRestart = 0;
if(nthSoftRestart <= softRestarts)
{
nthSoftRestart++;
return true;
}
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not guarded by checkpointingEnabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allowed restart attempts, even without checkpoints (from the start of the simulation). This only maintains old functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to this class just to keep the user facing checkpoint.restart.loop input

Comment on lines +128 to +132
* @tparam checkpointingEnabled A boolean to enable/disable checkpointing features.
*/
template<bool checkpointingEnabled>
struct Checkpointing
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly dislike the approach of guarding each individual function body with an if constexpr because you have to repeat this over and over again. I would have expected to just provide a specialisation

template<>
struct Checkpointing<false> {
    void registerHelp(...) {}
    ...
};

Is there a strong reason for your design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used your suggestion

Comment on lines 139 to 161
// clang-format off
desc.add_options()
("checkpoint.restart.loop", po::value<uint32_t>(&softRestarts)->default_value(0),
"Number of times to restart the simulation after simulation has finished (for presentations). "
"Note: does not yet work with all plugins, see issue #1305")
("checkpoint.restart", po::value<RestartState>(&restartState)->zero_tokens()->implicit_value(RestartState::FORCE),
"Restart simulation from a checkpoint. Requires a valid checkpoint.")
("checkpoint.tryRestart", po::value<RestartState>(&restartState)->zero_tokens()->implicit_value(RestartState::TRY),
"Try to restart if a checkpoint is available else start the simulation from scratch.")
("checkpoint.restart.directory", po::value<std::string>(&restartDirectory)->default_value(restartDirectory),
"Directory containing checkpoints for a restart")
("checkpoint.restart.step", po::value<int32_t>(&restartStep),
"Checkpoint step to restart from")
("checkpoint.period", po::value<std::string>(&checkpointPeriod),
"Period for checkpoint creation [interval(s) based on steps]")
("checkpoint.timePeriod", po::value<std::uint64_t>(&checkpointPeriodMinutes),
"Time periodic checkpoint creation [period in minutes]")
("checkpoint.directory", po::value<std::string>(&checkpointDirectory)->default_value(checkpointDirectory),
"Directory for checkpoints");
// clang-format on
// translate checkpointPeriod string into checkpoint intervals
seqCheckpointPeriod = pluginSystem::toTimeSlice(checkpointPeriod);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff'ed against the removed parts from SimulationHelper and looks correct.

if(checkpointTimeThread.joinable())
checkpointTimeThread.join();

checkpointing.endTimeBasedCheckpointing();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, camel case is not great here. I had to dig into the code to realise that it's not an apocalytic "end-time-based checkpointing" but it's just "ending the time-based checkpointing". Could we rename to "finish"/"cancel"/...?

(I would have been curious what an apocalyptic "end-time-based checkpointing" could look like, though, if you ever want to implement one...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to finishTimeBasedCheckpointing

Copy link
Contributor

@chillenzer chillenzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me otherwise. Please do the marked renamings.

Comment on lines 225 to 234
bool doSoftRestart()
{
static uint32_t nthSoftRestart = 0;
if(nthSoftRestart <= softRestarts)
{
nthSoftRestart++;
return true;
}
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this function. Starting with do... suggests heavily that it actually does something. In the while context it's used in it doesn't read too bad but out of context it's weird. You might want to try something along the lines of moreSoftRestartAttemptsAllowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to canSoftRestart. But as I type it now i think this name has the empty problem. It is not clear if the caller asking or telling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now changed to the more verbose tryConsumeRestartAttempt

Comment on lines 236 to 255
void doTimeBasedCheckpointing()
{
if constexpr(checkpointingEnabled)
{
// register concurrent thread to perform checkpointing periodically after a user defined time
if(checkpointPeriodMinutes != 0)
checkpointTimeThread = std::thread(
[&, this]()
{
std::unique_lock<std::mutex> lk(this->concurrentThreadMutex);
while(exitConcurrentThreads.wait_until(
lk,
std::chrono::system_clock::now() + std::chrono::minutes(checkpointPeriodMinutes))
== std::cv_status::timeout)
{
signal::detail::setCreateCheckpoint(1);
}
});
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: do... suggests that it does something while it actually rather start...s or activate...s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to startTimeBasedCheckpointing to complement the finishTimeBasedCheckpointing

@PrometheusPi
Copy link
Member

@ikbuibui what is the status of this PR?

@ikbuibui
Copy link
Contributor Author

I still need to address the requested changes, will happen sometime this week.

Specialized the Checkpointing class template for disabled, i.e. false
type
Did some renamings
Unified the enum to string mapping
@ikbuibui ikbuibui force-pushed the disable_checkpointing branch from e9e750f to b2f4e59 Compare November 21, 2025 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component: core in PIConGPU (core application)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

checkpoint flags still exist without openPMD

3 participants