Skip to content

Conversation

@rmarow
Copy link
Contributor

@rmarow rmarow commented Jan 15, 2026

Related PR's ogdc-helm and ogdc-recipe

)

# For now, use the first data object
# TODO: Allow user to specify which object or handle multiple
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if there is a better way to handle this...

Copy link
Member

@trey-stafford trey-stafford Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So right now things are setup at the dataset level, right? My thinking is that the user would specify both a dataset identifier and a file/object name. We would filter for the desired name and fetch that.

In other words, for seal tags it might look like:

input:
  params:
    - type: "dataone"
      dataset_id:  "resource_map_urn:uuid:cfe3fbb2-0585-40b5-8243-8fa47fcfeb9b"
      filename: "ct71_ODV.csv "

And another recipe with multiple inputs might look like:

input:
  params:
    - type: "dataone"
      dataset_id:  "foo"
      filename: "foo-file1"
    - type: "dataone"
      dataset_id:  "foo"
      filename: "foo-file2"
    - type: "dataone"
      dataset_id:  "bar"
      filename: "bar-file1"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also think about situations where a dataset has duplicate filenames, but in different subdirectories. E.g.,

subdir_a/0.tif
subdir_b/0.tif
subdir_c/0.tif

In this case, we might want to get all of the 0.tif and maintain the directory structure.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicate filenames but in different subdirectories happens a lot, especially with standard tile directory layouts like WMTS. Here's a real example from PDG data:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! opened #149 to capture this case so we do not forget about it.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for DataONE dataset identifiers as a new input type for OGDC recipes. The implementation allows users to specify DataONE package identifiers which are automatically resolved to downloadable data object URLs during recipe configuration validation.

Changes:

  • Added new "dataone" input type with member node configuration and resolution logic
  • Created DataONEResolver class to query DataONE Solr API and resolve package identifiers to data objects
  • Integrated DataONE input handling into the existing fetch workflow

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/ogdc_runner/dataone/resolver.py New module implementing DataONEResolver class to query Solr API and resolve dataset identifiers to downloadable URLs
src/ogdc_runner/dataone/init.py Package initialization exposing resolve_dataone_input function
src/ogdc_runner/models/recipe_config.py Extended InputParam model with dataone type and optional fields; added validator to resolve DataONE inputs during config initialization
src/ogdc_runner/inputs.py Added dataone case to fetch input logic using resolved URLs
pyproject.toml Added dataone.libclient dependency; updated mypy configuration for d1_client modules; added ruff exception for recipe_config.py

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rmarow and others added 4 commits January 15, 2026 13:21
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@rmarow rmarow marked this pull request as ready for review January 15, 2026 21:18
Copy link
Member

@trey-stafford trey-stafford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Very pleased to see us beginning to interact directly with the dataone API/Solar interface.

I have some feedback that we should consider addressing before merging. In particular, we should think about the input params a recipe provides. I think it would be nice to have each input param specify both the dataset identifier and the desired data object/filename.

Also agreed with some of the copilot feedback - some tests would be nice! I would probably start with something that asserts a URL can be resolved given a dataset identifier and filename.

Maybe adding something to the docs (here: https://ogdc-runner.readthedocs.io/en/latest/recipes.html#input) would be valuable.

And, make sure to update the CHANGELOG!

)

# For now, use the first data object
# TODO: Allow user to specify which object or handle multiple
Copy link
Member

@trey-stafford trey-stafford Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So right now things are setup at the dataset level, right? My thinking is that the user would specify both a dataset identifier and a file/object name. We would filter for the desired name and fetch that.

In other words, for seal tags it might look like:

input:
  params:
    - type: "dataone"
      dataset_id:  "resource_map_urn:uuid:cfe3fbb2-0585-40b5-8243-8fa47fcfeb9b"
      filename: "ct71_ODV.csv "

And another recipe with multiple inputs might look like:

input:
  params:
    - type: "dataone"
      dataset_id:  "foo"
      filename: "foo-file1"
    - type: "dataone"
      dataset_id:  "foo"
      filename: "foo-file2"
    - type: "dataone"
      dataset_id:  "bar"
      filename: "bar-file1"

@trey-stafford
Copy link
Member

In particular, we should think about the input params a recipe provides. I think it would be nice to have each input param specify both the dataset identifier and the desired data object/filename.

As soon as I submitted this review, it occurred to me that we might actually have cases where we want to fetch everything a dataset contains (e.g., there might be a dataset with 100 geotiffs that the user wants to mosaic into a single image). In this case, we wouldn't necessarily want to force the user to specify all of the files in the dataset. The default could be "get all the files", but if a filename or (object_id if that makes more sense?), then we filter just for that.

In another comment above I suggest this format for dataone input params:

input:
  params:
    - type: "dataone"
      dataset_id:  "resource_map_urn:uuid:cfe3fbb2-0585-40b5-8243-8fa47fcfeb9b"
      filename: "ct71_ODV.csv "

Under this new scenario the user would leave the "filename" unset (behind the scenes filename could be None, which would mean "don't filter - give me everything!").

Since this is close to what you've already implemented, my idea for allowing filtering to the filename level could be a follow-up feature that we can capture in a new issue.

@rmarow
Copy link
Contributor Author

rmarow commented Jan 16, 2026

Note: running in the api did not work when ogdc-helm values had the test value (QGreenland-Net/ogdc-helm@1436229) but did when it had this value (QGreenland-Net/ogdc-helm@ffb22e8)

self.member_node = DATAONE_NODE_URL
self.client = MemberNodeClient_2_0(base_url=DATAONE_NODE_URL)

def resolve_dataset(self, dataset_identifier: str) -> list[dict[str, Any]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good for the package level fetch, but may be we can also add support data object level fetch?
i.e. going by the seal_tags example, say if I already have an id for the data object that I'm interested in (ct71_ODV.csv), we could directly use it to fetch the object using its identifier urn:uuid:31162eb9-7e3b-4b88-948f-f4c99f13a83f (cn_resolve_ct71_ODV)

@rushirajnenuji
Copy link
Member

Thank you @rmarow, fantastic progress on adding this support. And, appreciate thorough review and comments from @trey-stafford above. I've added some comments, please let me know if you have questions. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants