You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many HPC applications need to implement a checkpoint/restart capability to address either:
maximum job time allocations (e.g., no jobs can run for more than 24 hours)
fault tolerance (e.g., individual parts of a supercomputer can fail leading to premature job failure)
To address these concerns, typical HPC applications periodically write their internal state to hard drive (checkpointing) and then have the ability to restart and resume progress from the last checkpoint file.
Hiop has some ability to warm start through the get_starting_point() or get_warm_start(). However, it would be better if Hiop could save more of its internal state to do a better restart.
The ::axom::sidre package provides a flexible checkpoint/restart API and implementation. Multiple LLNL packages use ::axom::sidre. The repository is here. It's buildable with Spack.
The text was updated successfully, but these errors were encountered:
cnpetra
changed the title
Add advanced checkpont/restart capabilities to Hiop
Add advanced checkpoint/restart capabilities to Hiop
Jun 3, 2024
Many HPC applications need to implement a checkpoint/restart capability to address either:
To address these concerns, typical HPC applications periodically write their internal state to hard drive (checkpointing) and then have the ability to restart and resume progress from the last checkpoint file.
Hiop has some ability to warm start through the
get_starting_point()
orget_warm_start()
. However, it would be better if Hiop could save more of its internal state to do a better restart.The ::axom::sidre package provides a flexible checkpoint/restart API and implementation. Multiple LLNL packages use
::axom::sidre
. The repository is here. It's buildable with Spack.The text was updated successfully, but these errors were encountered: