Skip to content

Conversation

@ftgoktas
Copy link
Contributor

@ftgoktas ftgoktas commented Aug 28, 2025

This PR migrates SWELL from r2d2 v1 to r2d2 v3 API for centralized metadata management.
Updates GetObservations, GetBackground, and SaveObsDiags tasks to use new API syntax.
Adresses #318

Key Changes:

  • Updated module loading to use r2d2-client/sles15_0604 instead of r2d2/sles15_spack19
  • Modified r2d2 configuration to use v3 API parameters (data_hub, data_store, compute_host)
  • Updated SWELL tasks to use new r2d2 v3 API syntax:
    • GetObservations: Updated fetch parameters (item='observation', window_start, etc.)
    • GetBackground: Updated for forecast item fetching
    • SaveObsDiags: Updated to store as item='feedback' with proper parameters
  • Added environment variable handling for r2d2 v3 authentication (R2D2_USER, R2D2_API_KEY)
  • Fixed module loading conflicts between JEDI bundles and r2d2 v3
  • Created a script for registering observation, background, and bias correction files with r2d2 v3 schema

Tested with 3dvar and 3dvar_atmos suites - successfully fetches observations/backgrounds and stores feedback files.

Dependencies

  • r2d2 v3 infrastructure (nccs-gmao data hub, r2d2-experiments-nccs-gmao data store)
  • r2d2-client/sles15_0604 module availability on target platform

@ftgoktas ftgoktas changed the title R2D2 v3 migration #318 R2D2 v3 migration Aug 28, 2025
@ftgoktas ftgoktas changed the title R2D2 v3 migration R2D2 v3 Adaptation Aug 28, 2025
@ftgoktas ftgoktas requested review from Dooruk and mranst August 28, 2025 22:22
@ftgoktas ftgoktas marked this pull request as draft August 28, 2025 22:22
@ftgoktas ftgoktas requested a review from ashiklom August 28, 2025 22:22
@ftgoktas ftgoktas self-assigned this Aug 28, 2025
@ftgoktas ftgoktas added the enhancement New feature or request label Aug 28, 2025
@ftgoktas ftgoktas linked an issue Aug 28, 2025 that may be closed by this pull request
6 tasks
mranst
mranst previously requested changes Aug 29, 2025
Copy link
Collaborator

@mranst mranst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work- It seems like r2d2 v3 requires a lot of environment variables to be set in order to run. @ashiklom or @Dooruk may have a better idea, but maybe a way to handle this would be to have a config file called ~/.swell/r2d2_credentials.yaml where the user could store this information. There could then be a function within TaskBase that parses and sets these variables using os.environ (you can see a somewhat similar example to this in slurm.py, where we set slurm defaults in ~/.swell/swell-slurm.yaml). We're also going to need some documentation explaining this, I'm having trouble getting it running with all the necessary environment additions

Copy link
Collaborator

@Dooruk Dooruk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ftgoktas great job! finally v3 is becoming a reality.

There are Swell specific peculiarities, @mranst pointed out most of them. Also there are built-in functions like logger so you could use them. We should store R2D2 related config and environment variables inside a file like ~/.swell/r2d2-config.yaml.

A few additional notes and Q.s:

  • There is a little bit of separation happening here with the atmosphere side accessing obs file outside of R2D2 context, which is fine during the development.

  • Once we have something like ~/.swell/r2d2-config.yaml we need to have a dedicated section in the documentation, maybe pointing at JCSDA-internal/R2D2 documentation and with useful commands.

  • Is it possible to request API keys for @mranst and @ashiklom? Or is it possible to request a generic "gmao-user" API key and have special ones for sudo type users? That way we can create a generic ~/.swell/r2d2-config.yaml if none exists, otherwise I imagine %98 users will be flabbergasted and message one of us without looking at the documentation.

  • I probably asked this before but Is linking rather than copying an option with fetch?

  • (Long term) Let's keep in mind we would like to use R2D2 on AWS or NAS and able to access cloud dataset (e.g., NNJA) with it.

Copy link
Collaborator

@Dooruk Dooruk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to run three suites successfully!

Most of my comments are taking out redundant logger outputs as r2d2 already puts out that information. For me this is really close, with three things holding this back for me, other than ingesting more obs. ofcourse:

  1. There is an addition to the taskBase for r2d2 but could that be put in utiltiies/r2d2.py and only employed during R2D2 relevant tasks?
  2. Model types should support more for marine (I briefly explained this in my comment)
  3. In terms of storage, what "lifetime" are we choosing? Are people storing their experiments temporarily? Is everyone storing their experiments in the general gmao database?

Also, can you bump up the SWELL version? I would like to make a release before we merge this.

I also noticed a slight increase in the time it takes to complete getObservations. Not a huge deal as we can trigger this task at a cycle before any other dependencies but just got me thinking.

@Dooruk
Copy link
Collaborator

Dooruk commented Oct 28, 2025

I probably forgot to mention this one but in cycling tasks, there is a need to save backgrounds and/or restarts. saveObsDiags.py is saving observation outputs but we would also like to save model fields, could be forecast, analysis, or restarts. Currently saveRestart.py does this in a super manual way but it would be nice R2D2v3 to handle this for us. This is a bit more involved with model runs so I can look into that with the new approach but I need to grasp a few more things in terms of "lifetime".

I am not super clear on where we will end up in terms of experiment_id (manual setup vs. randomly assigned) and how long the items are stored within R2D2 but I see there are lifetime name values ( debug, science, publication, release). I would say only certain users would have access to permanent storage, correct?

@mer-a-o do you store all hourly model state files for the backgrounds? Also do you store ensemble files? (you being in Skylab)

@Dooruk
Copy link
Collaborator

Dooruk commented Nov 4, 2025

So this fails the save_restart.py task, which is basically just a copying operation meant to be used with R2D2v1. However, that lead me to thinking further about a potential issue which I described below:

https://github.com/JCSDA-internal/r2d2/issues/839

For now I can deactivate the save_restart and move forward with this PR but I would like to ingest more obs before merging.

@mranst would you mind testing this on gmao_ci when you get a chance since this will break tests there unless the API key is setup?

@mranst
Copy link
Collaborator

mranst commented Nov 4, 2025

@mranst would you mind testing this on gmao_ci when you get a chance since this will break tests there unless the API key is setup?

https://github.com/GEOS-ESM/swell/actions/runs/19082156562

I added the gmao_user credential file to gmao_ci, r2d2 tasks are failing with this error, looks like it's expecting a string true or false and getting a bool type at some point. It's happening for me locally as well

image

@ftgoktas
Copy link
Contributor Author

ftgoktas commented Nov 6, 2025

The issue was caused by a bug in the r2d2 codebase, the r2d2-client module on Discover has been updated with the fix, and tasks should now run without errors. @mranst could you please pull the latest changes and re-run the tests?

@mranst
Copy link
Collaborator

mranst commented Nov 6, 2025

The issue was caused by a bug in the r2d2 codebase, the r2d2-client module on Discover has been updated with the fix, and tasks should now run without errors. @mranst could you please pull the latest changes and re-run the tests?

https://github.com/GEOS-ESM/swell/actions/runs/19150912180/job/54740655821

SaveRestart is failing for 3dvar_cycle and 3dfgat_cycle, but otherwise looks good

@ftgoktas
Copy link
Contributor Author

ftgoktas commented Nov 6, 2025

This will be addressed in #652

@Dooruk
Copy link
Collaborator

Dooruk commented Nov 7, 2025

Working towards merging this, some final things:

  • I need to turn off saveRestart for this PR. I will create a branch off of yours and make a PR for that.
  • storebackground is not used currently but we need to revive that (rename it to saveBackground) after this goes in. restart and background are different in DA and modeling context.

Q.s:

  • I see saveobsdiags are in feedback. Is feedback specific item like a IODA output? i couldn't find much information on this.
  • I made a tiny edit to your ingest script and actually testing with that it seems to be working. However there seems to be no protection in terms of who can write into nccs-gmao data_hub currently or who can remove files from there, is that correct?

@ftgoktas
Copy link
Contributor Author

ftgoktas commented Nov 7, 2025

Q.s:

  • I see saveobsdiags are in feedback. Is feedback specific item like a IODA output? i couldn't find much information on this.

Yes, feedback is for IODA observation diagnostic files produced by DA systems, contains original observations plus analysis results. I had confirmed with the R2D2 team that JEDI 3dvar output files (observations + analysis feedback) should use item='feedback', which applies to SaveObsDiags since that task stores those files.

  • I made a tiny edit to your ingest script and actually testing with that it seems to be working. However there seems to be no protection in terms of who can write into nccs-gmao data_hub currently or who can remove files from there, is that correct?

The ingest script doesn't enforce permissions; it relies on the R2D2 v3 API server to enforce authentication/authorization, so unauthorized users will get permission errors from the API if they don't have the API key set up in their environment.

@Dooruk
Copy link
Collaborator

Dooruk commented Nov 7, 2025

Ok, this is ready to go in after #653 merges. There are still things to be addressed but no need to hold this one back.

I made a release right before this so we should be good.

Copy link
Collaborator

@Dooruk Dooruk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent job! Long time coming.

@Dooruk Dooruk requested a review from mranst November 7, 2025 20:06
@Dooruk Dooruk merged commit 36454d3 into develop Nov 7, 2025
2 checks passed
@Dooruk Dooruk deleted the feature/r2d2_v3 branch November 7, 2025 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants