Skip to content

Update HPSS archiving so we can restart experiements from archive #3802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JessicaMeixner-NOAA
Copy link
Contributor

Description

This is the first step in ensuring we can restart retros from HPSS archive. This PR makes sure we can restart an experiment from what we have in archive. The next step will then be making sure we can replicate an experiment (NOAA-EMC/GFS#3 which this may or may not already do) and lastly long term we are working on cleaning up which files are archived the minimal amount possible. We will start that by creating a list of files and what we think should be in HPSS (NOAA-EMC/GFS#4).

Fixes #3764
Fixes #3758
Also makes sure we save all of the restart files for MOM6 not just the .res. file.

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? NO

How has this been tested?

Have restarted both the C96C48mx500_S2SW_cyc_gfs.yaml and gfsv17/C1152mx025_S2SW_rdhpcs.yaml from their own archives. Only the first half-cycle of gfsv17/C1152mx025_S2SW_rdhpcs.yaml has successfully started so far, experiement is still going.

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

@JessicaMeixner-NOAA JessicaMeixner-NOAA self-assigned this Jun 13, 2025
@DavidHuber-NOAA DavidHuber-NOAA requested a review from Copilot June 18, 2025 13:08
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the HPSS archiving templates to include additional restart files (wave, ocean, ice) so experiments can be resumed directly from archive. Key changes:

  • Adds the wave-restart include in master_gdas.yaml.j2 and relocates the ocean-restart block under its own DO_OCN check.
  • Annotates the wave restart template with clarifying comments.
  • Extends the EnKF restart groups to archive ocean and ice restart files and adjusts the ocean pattern to a wildcard match.
Comments suppressed due to low confidence (3)

parm/archive/enkf_restartb_grp.yaml.j2:46

  • New conditional branch for ocean/ice restarts should be covered by CI tests to catch any templating or path errors early.
        {% if DOHYBVAR_OCN %}

parm/archive/master_gdas.yaml.j2:36

  • This include is inside a filter indent block but lacks leading spaces; indent it to match the other includes so the generated YAML stays valid.
{% include "gdaswave_restart.yaml.j2" %}

parm/archive/gdaswave_restart.yaml.j2:5

  • The TODO remains unaddressed; consider explicitly listing the two wave restart filenames now to avoid archiving unintended files.
        # TODO explicitly name the wave restart files to archive

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@JessicaMeixner-NOAA
Copy link
Contributor Author

Just thought I'd ping on this PR to see if there are things anyone is waiting on before this PR is reviewed, tested, etc. Perhaps it's just waiting in a queue?

@DavidHuber-NOAA
Copy link
Contributor

Launching CI on C6 to test archiving.

@DavidHuber-NOAA DavidHuber-NOAA added the CI-Gaeac6-Ready **CM use only** PR is ready for CI testing on Gaea C6 label Jun 23, 2025
@emcbot emcbot added CI-Gaeac6-Building **Bot use only** CI testing is cloning/building on Gaea C6 CI-Gaeac6-Running and removed CI-Gaeac6-Ready **CM use only** PR is ready for CI testing on Gaea C6 CI-Gaeac6-Building **Bot use only** CI testing is cloning/building on Gaea C6 labels Jun 23, 2025
@emcbot
Copy link

emcbot commented Jun 24, 2025

Experiment C96C48mx500_S2SW_cyc_gfs FAILED on Gaeac6 in Build# 1 with error logs:

/gpfs/f6/drsa-precip3/world-shared/global/CI/3802/RUNTESTS/COMROOT/C96C48mx500_S2SW_cyc_gfs_0c3bf5f8/logs/2021122100/gfs_wavepostbndpntbll.log

Follow link here to view the contents of the above file(s): (gfs_wavepostbndpntbll.log)

@emcbot emcbot added CI-Gaeac6-Failed **Bot use only** CI testing on Gaea C6 for this PR has failed and removed CI-Gaeac6-Running labels Jun 24, 2025
@emcbot
Copy link

emcbot commented Jun 24, 2025

Experiment C96C48mx500_S2SW_cyc_gfs FAILED on Gaeac6 in Build# 1 in
/gpfs/f6/drsa-precip3/world-shared/global/CI/3802/RUNTESTS/EXPDIR/C96C48mx500_S2SW_cyc_gfs_0c3bf5f8

@emcbot emcbot added CI-Gaeac6-Failed **Bot use only** CI testing on Gaea C6 for this PR has failed and removed CI-Gaeac6-Failed **Bot use only** CI testing on Gaea C6 for this PR has failed labels Jun 24, 2025
@DavidHuber-NOAA
Copy link
Contributor

@JessicaMeixner-NOAA The gfs_wavepostbndpntbll job failed due to a wallclock timeout. Rebooting the job allowed it to run to completion after ~8 minutes. Should the wallclock be increased to 15 minutes?

I am manually running the remaining on C6 now.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@DavidHuber-NOAA - Yes, that sounds very reasonable. Would you like me to push that update to this PR?

@DavidHuber-NOAA
Copy link
Contributor

@DavidHuber-NOAA - Yes, that sounds very reasonable. Would you like me to push that update to this PR?

Yes, please do.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@DavidHuber-NOAA - Yes, that sounds very reasonable. Would you like me to push that update to this PR?

Yes, please do.

done.

@DavidHuber-NOAA
Copy link
Contributor

All tests completed successfully on C6.

@DavidHuber-NOAA DavidHuber-NOAA added CI-Gaeac6-Passed **Bot use only** CI testing on Gaea C6 for this PR has completed successfully and removed CI-Gaeac6-Failed **Bot use only** CI testing on Gaea C6 for this PR has failed labels Jun 24, 2025
@DavidHuber-NOAA DavidHuber-NOAA merged commit 3981e40 into NOAA-EMC:develop Jun 24, 2025
5 checks passed
@JessicaMeixner-NOAA JessicaMeixner-NOAA deleted the updatearchivehpss branch June 26, 2025 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Gaeac6-Passed **Bot use only** CI testing on Gaea C6 for this PR has completed successfully
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Updates needed for archive gdaswave_restart Marine ensemble restarts need to be added to restartb tarball on HPSS
3 participants