-
Notifications
You must be signed in to change notification settings - Fork 194
Update HPSS archiving so we can restart experiements from archive #3802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update HPSS archiving so we can restart experiements from archive #3802
Conversation
Conflicts: parm/archive/enkf_restarta_grp.yaml.j2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the HPSS archiving templates to include additional restart files (wave, ocean, ice) so experiments can be resumed directly from archive. Key changes:
- Adds the wave-restart include in
master_gdas.yaml.j2
and relocates the ocean-restart block under its ownDO_OCN
check. - Annotates the wave restart template with clarifying comments.
- Extends the EnKF restart groups to archive ocean and ice restart files and adjusts the ocean pattern to a wildcard match.
Comments suppressed due to low confidence (3)
parm/archive/enkf_restartb_grp.yaml.j2:46
- New conditional branch for ocean/ice restarts should be covered by CI tests to catch any templating or path errors early.
{% if DOHYBVAR_OCN %}
parm/archive/master_gdas.yaml.j2:36
- This include is inside a filter indent block but lacks leading spaces; indent it to match the other includes so the generated YAML stays valid.
{% include "gdaswave_restart.yaml.j2" %}
parm/archive/gdaswave_restart.yaml.j2:5
- The TODO remains unaddressed; consider explicitly listing the two wave restart filenames now to avoid archiving unintended files.
# TODO explicitly name the wave restart files to archive
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Just thought I'd ping on this PR to see if there are things anyone is waiting on before this PR is reviewed, tested, etc. Perhaps it's just waiting in a queue? |
Launching CI on C6 to test archiving. |
Experiment C96C48mx500_S2SW_cyc_gfs FAILED on Gaeac6 in Build# 1 with error logs:
Follow link here to view the contents of the above file(s): (gfs_wavepostbndpntbll.log) |
Experiment C96C48mx500_S2SW_cyc_gfs FAILED on Gaeac6 in Build# 1 in |
@JessicaMeixner-NOAA The gfs_wavepostbndpntbll job failed due to a wallclock timeout. Rebooting the job allowed it to run to completion after ~8 minutes. Should the wallclock be increased to 15 minutes? I am manually running the remaining on C6 now. |
@DavidHuber-NOAA - Yes, that sounds very reasonable. Would you like me to push that update to this PR? |
Yes, please do. |
done. |
All tests completed successfully on C6. |
Description
This is the first step in ensuring we can restart retros from HPSS archive. This PR makes sure we can restart an experiment from what we have in archive. The next step will then be making sure we can replicate an experiment (NOAA-EMC/GFS#3 which this may or may not already do) and lastly long term we are working on cleaning up which files are archived the minimal amount possible. We will start that by creating a list of files and what we think should be in HPSS (NOAA-EMC/GFS#4).
Fixes #3764
Fixes #3758
Also makes sure we save all of the restart files for MOM6 not just the .res. file.
Type of change
Change characteristics
How has this been tested?
Have restarted both the C96C48mx500_S2SW_cyc_gfs.yaml and gfsv17/C1152mx025_S2SW_rdhpcs.yaml from their own archives. Only the first half-cycle of gfsv17/C1152mx025_S2SW_rdhpcs.yaml has successfully started so far, experiement is still going.
Checklist