-
Notifications
You must be signed in to change notification settings - Fork 0
76 Create dataone input type #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| ) | ||
|
|
||
| # For now, use the first data object | ||
| # TODO: Allow user to specify which object or handle multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if there is a better way to handle this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So right now things are setup at the dataset level, right? My thinking is that the user would specify both a dataset identifier and a file/object name. We would filter for the desired name and fetch that.
In other words, for seal tags it might look like:
input:
params:
- type: "dataone"
dataset_id: "resource_map_urn:uuid:cfe3fbb2-0585-40b5-8243-8fa47fcfeb9b"
filename: "ct71_ODV.csv "
And another recipe with multiple inputs might look like:
input:
params:
- type: "dataone"
dataset_id: "foo"
filename: "foo-file1"
- type: "dataone"
dataset_id: "foo"
filename: "foo-file2"
- type: "dataone"
dataset_id: "bar"
filename: "bar-file1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also think about situations where a dataset has duplicate filenames, but in different subdirectories. E.g.,
subdir_a/0.tif
subdir_b/0.tif
subdir_c/0.tif
In this case, we might want to get all of the 0.tif and maintain the directory structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplicate filenames but in different subdirectories happens a lot, especially with standard tile directory layouts like WMTS. Here's a real example from PDG data:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! opened #149 to capture this case so we do not forget about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for DataONE dataset identifiers as a new input type for OGDC recipes. The implementation allows users to specify DataONE package identifiers which are automatically resolved to downloadable data object URLs during recipe configuration validation.
Changes:
- Added new "dataone" input type with member node configuration and resolution logic
- Created DataONEResolver class to query DataONE Solr API and resolve package identifiers to data objects
- Integrated DataONE input handling into the existing fetch workflow
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ogdc_runner/dataone/resolver.py | New module implementing DataONEResolver class to query Solr API and resolve dataset identifiers to downloadable URLs |
| src/ogdc_runner/dataone/init.py | Package initialization exposing resolve_dataone_input function |
| src/ogdc_runner/models/recipe_config.py | Extended InputParam model with dataone type and optional fields; added validator to resolve DataONE inputs during config initialization |
| src/ogdc_runner/inputs.py | Added dataone case to fetch input logic using resolved URLs |
| pyproject.toml | Added dataone.libclient dependency; updated mypy configuration for d1_client modules; added ruff exception for recipe_config.py |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
trey-stafford
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Very pleased to see us beginning to interact directly with the dataone API/Solar interface.
I have some feedback that we should consider addressing before merging. In particular, we should think about the input params a recipe provides. I think it would be nice to have each input param specify both the dataset identifier and the desired data object/filename.
Also agreed with some of the copilot feedback - some tests would be nice! I would probably start with something that asserts a URL can be resolved given a dataset identifier and filename.
Maybe adding something to the docs (here: https://ogdc-runner.readthedocs.io/en/latest/recipes.html#input) would be valuable.
And, make sure to update the CHANGELOG!
| ) | ||
|
|
||
| # For now, use the first data object | ||
| # TODO: Allow user to specify which object or handle multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So right now things are setup at the dataset level, right? My thinking is that the user would specify both a dataset identifier and a file/object name. We would filter for the desired name and fetch that.
In other words, for seal tags it might look like:
input:
params:
- type: "dataone"
dataset_id: "resource_map_urn:uuid:cfe3fbb2-0585-40b5-8243-8fa47fcfeb9b"
filename: "ct71_ODV.csv "
And another recipe with multiple inputs might look like:
input:
params:
- type: "dataone"
dataset_id: "foo"
filename: "foo-file1"
- type: "dataone"
dataset_id: "foo"
filename: "foo-file2"
- type: "dataone"
dataset_id: "bar"
filename: "bar-file1"
As soon as I submitted this review, it occurred to me that we might actually have cases where we want to fetch everything a dataset contains (e.g., there might be a dataset with 100 geotiffs that the user wants to mosaic into a single image). In this case, we wouldn't necessarily want to force the user to specify all of the files in the dataset. The default could be "get all the files", but if a In another comment above I suggest this format for dataone input params: Under this new scenario the user would leave the "filename" unset (behind the scenes Since this is close to what you've already implemented, my idea for allowing filtering to the filename level could be a follow-up feature that we can capture in a new issue. |
|
Note: running in the api did not work when ogdc-helm values had the test value (QGreenland-Net/ogdc-helm@1436229) but did when it had this value (QGreenland-Net/ogdc-helm@ffb22e8) |
| self.member_node = DATAONE_NODE_URL | ||
| self.client = MemberNodeClient_2_0(base_url=DATAONE_NODE_URL) | ||
|
|
||
| def resolve_dataset(self, dataset_identifier: str) -> list[dict[str, Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good for the package level fetch, but may be we can also add support data object level fetch?
i.e. going by the seal_tags example, say if I already have an id for the data object that I'm interested in (ct71_ODV.csv), we could directly use it to fetch the object using its identifier urn:uuid:31162eb9-7e3b-4b88-948f-f4c99f13a83f (cn_resolve_ct71_ODV)
|
Thank you @rmarow, fantastic progress on adding this support. And, appreciate thorough review and comments from @trey-stafford above. I've added some comments, please let me know if you have questions. Thanks again! |
from Rushiraj's commentsUpdates from Rushiraj's commentsUpdates from Rushiraj's commentsUpdates from Rushiraj's commentsUpdates from Rushiraj's commentsUpdates from Rushiraj's commentsUpdates from Rushiraj's comments
Related PR's ogdc-helm and ogdc-recipe