76 Create dataone input type #147

rmarow · 2026-01-15T00:18:37Z

rmarow · 2026-01-15T03:06:29Z

src/ogdc_runner/models/recipe_config.py

+                        )
+
+                    # For now, use the first data object
+                    # TODO: Allow user to specify which object or handle multiple


not sure if there is a better way to handle this...

So right now things are setup at the dataset level, right? My thinking is that the user would specify both a dataset identifier and a file/object name. We would filter for the desired name and fetch that.

In other words, for seal tags it might look like:

input: params: - type: "dataone" dataset_id: "resource_map_urn:uuid:cfe3fbb2-0585-40b5-8243-8fa47fcfeb9b" filename: "ct71_ODV.csv "

And another recipe with multiple inputs might look like:

input: params: - type: "dataone" dataset_id: "foo" filename: "foo-file1" - type: "dataone" dataset_id: "foo" filename: "foo-file2" - type: "dataone" dataset_id: "bar" filename: "bar-file1"

We should also think about situations where a dataset has duplicate filenames, but in different subdirectories. E.g.,

subdir_a/0.tif subdir_b/0.tif subdir_c/0.tif

In this case, we might want to get all of the 0.tif and maintain the directory structure.

The duplicate filenames but in different subdirectories happens a lot, especially with standard tile directory layouts like WMTS. Here's a real example from PDG data:

https://arcticdata.io/data/10.18739/A2KW57K57/iwp_geopackage_high/WGS1984Quad/15/4773/3479.gpkg

https://arcticdata.io/data/10.18739/A2KW57K57/iwp_geopackage_high/WGS1984Quad/15/4774/3479.gpkg

Thanks ! opened #149 to capture this case so we do not forget about it.

Copilot

Pull request overview

This PR adds support for DataONE dataset identifiers as a new input type for OGDC recipes. The implementation allows users to specify DataONE package identifiers which are automatically resolved to downloadable data object URLs during recipe configuration validation.

Changes:

Added new "dataone" input type with member node configuration and resolution logic
Created DataONEResolver class to query DataONE Solr API and resolve package identifiers to data objects
Integrated DataONE input handling into the existing fetch workflow

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
src/ogdc_runner/dataone/resolver.py	New module implementing DataONEResolver class to query Solr API and resolve dataset identifiers to downloadable URLs
src/ogdc_runner/dataone/init.py	Package initialization exposing resolve_dataone_input function
src/ogdc_runner/models/recipe_config.py	Extended InputParam model with dataone type and optional fields; added validator to resolve DataONE inputs during config initialization
src/ogdc_runner/inputs.py	Added dataone case to fetch input logic using resolved URLs
pyproject.toml	Added dataone.libclient dependency; updated mypy configuration for d1_client modules; added ruff exception for recipe_config.py

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pyproject.toml

src/ogdc_runner/models/recipe_config.py

src/ogdc_runner/dataone/resolver.py

src/ogdc_runner/models/recipe_config.py

pyproject.toml

src/ogdc_runner/dataone/resolver.py

src/ogdc_runner/models/recipe_config.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

trey-stafford

Great work! Very pleased to see us beginning to interact directly with the dataone API/Solar interface.

I have some feedback that we should consider addressing before merging. In particular, we should think about the input params a recipe provides. I think it would be nice to have each input param specify both the dataset identifier and the desired data object/filename.

Also agreed with some of the copilot feedback - some tests would be nice! I would probably start with something that asserts a URL can be resolved given a dataset identifier and filename.

Maybe adding something to the docs (here: https://ogdc-runner.readthedocs.io/en/latest/recipes.html#input) would be valuable.

And, make sure to update the CHANGELOG!

src/ogdc_runner/dataone/resolver.py

src/ogdc_runner/models/recipe_config.py

trey-stafford · 2026-01-15T22:43:44Z

src/ogdc_runner/models/recipe_config.py

+                        )
+
+                    # For now, use the first data object
+                    # TODO: Allow user to specify which object or handle multiple


So right now things are setup at the dataset level, right? My thinking is that the user would specify both a dataset identifier and a file/object name. We would filter for the desired name and fetch that.

In other words, for seal tags it might look like:

input: params: - type: "dataone" dataset_id: "resource_map_urn:uuid:cfe3fbb2-0585-40b5-8243-8fa47fcfeb9b" filename: "ct71_ODV.csv "

And another recipe with multiple inputs might look like:

input: params: - type: "dataone" dataset_id: "foo" filename: "foo-file1" - type: "dataone" dataset_id: "foo" filename: "foo-file2" - type: "dataone" dataset_id: "bar" filename: "bar-file1"

trey-stafford · 2026-01-15T23:07:51Z

In particular, we should think about the input params a recipe provides. I think it would be nice to have each input param specify both the dataset identifier and the desired data object/filename.

As soon as I submitted this review, it occurred to me that we might actually have cases where we want to fetch everything a dataset contains (e.g., there might be a dataset with 100 geotiffs that the user wants to mosaic into a single image). In this case, we wouldn't necessarily want to force the user to specify all of the files in the dataset. The default could be "get all the files", but if a filename or (object_id if that makes more sense?), then we filter just for that.

In another comment above I suggest this format for dataone input params:

input:
  params:
    - type: "dataone"
      dataset_id:  "resource_map_urn:uuid:cfe3fbb2-0585-40b5-8243-8fa47fcfeb9b"
      filename: "ct71_ODV.csv "

Under this new scenario the user would leave the "filename" unset (behind the scenes filename could be None, which would mean "don't filter - give me everything!").

Since this is close to what you've already implemented, my idea for allowing filtering to the filename level could be a follow-up feature that we can capture in a new issue.

rmarow · 2026-01-16T23:52:46Z

Note: running in the api did not work when ogdc-helm values had the test value (QGreenland-Net/ogdc-helm@1436229) but did when it had this value (QGreenland-Net/ogdc-helm@ffb22e8)

src/ogdc_runner/dataone/resolver.py

rushirajnenuji · 2026-01-21T19:47:50Z

src/ogdc_runner/dataone/resolver.py

+        self.member_node = DATAONE_NODE_URL
+        self.client = MemberNodeClient_2_0(base_url=DATAONE_NODE_URL)
+
+    def resolve_dataset(self, dataset_identifier: str) -> list[dict[str, Any]]:


this looks good for the package level fetch, but may be we can also add support data object level fetch?
i.e. going by the seal_tags example, say if I already have an id for the data object that I'm interested in (ct71_ODV.csv), we could directly use it to fetch the object using its identifier urn:uuid:31162eb9-7e3b-4b88-948f-f4c99f13a83f (cn_resolve_ct71_ODV)

rushirajnenuji · 2026-01-21T19:55:00Z

Thank you @rmarow, fantastic progress on adding this support. And, appreciate thorough review and comments from @trey-stafford above. I've added some comments, please let me know if you have questions. Thanks again!

src/ogdc_runner/dataone/resolver.py

from Rushiraj's commentsUpdates from Rushiraj's commentsUpdates from Rushiraj's commentsUpdates from Rushiraj's commentsUpdates from Rushiraj's commentsUpdates from Rushiraj's commentsUpdates from Rushiraj's comments

rmarow added 8 commits January 7, 2026 14:34

WIP

42474dc

add helper script

7d1d909

dependencies

34485f7

reworking a bit

724fc55

WIP - trying another way

4e57e55

Merge branch 'main' into 76-create-dataone-inputtype

e087227

Merge branch 'main' into 76-create-dataone-inputtype

ff37543

Merge branch 'main' into 76-create-dataone-inputtype

4d610c5

rmarow mentioned this pull request Jan 15, 2026

Add DataOne as an input type QGreenland-Net/.github#76

Open

rmarow added 2 commits January 14, 2026 19:32

Clarification

68ca4e0

resolve mypy failure

b708cf1

rmarow commented Jan 15, 2026

View reviewed changes

rmarow requested a review from Copilot January 15, 2026 03:11

Copilot started reviewing on behalf of rmarow January 15, 2026 03:12 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

rmarow and others added 4 commits January 15, 2026 13:21

Update pyproject.toml

b0ff4e9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/ogdc_runner/models/recipe_config.py

7657c61

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

f-strings, spelling errors, and pyproject cleanup

e8ef324

fix error message

0e9492f

rmarow marked this pull request as ready for review January 15, 2026 21:18

rmarow requested review from rushirajnenuji and trey-stafford January 15, 2026 21:18

start tests

7f39cf6

trey-stafford requested changes Jan 15, 2026

View reviewed changes

rmarow added 5 commits January 16, 2026 13:31

remove default to unknown in _get_filename

1a9db2e

create env variable for member node

b161c29

Organize input params

ad0b013

WIP PR requested changes/improvements

a783c75

resolve test error

ef87931

rmarow added 3 commits January 16, 2026 16:03

add type alias

7e23757

Type Alias and mypy fixes for it

2651b58

add needed envvar

31ef822

rushirajnenuji reviewed Jan 21, 2026

View reviewed changes

src/ogdc_runner/dataone/resolver.py Outdated Show resolved Hide resolved

rushirajnenuji reviewed Jan 21, 2026

View reviewed changes

src/ogdc_runner/dataone/resolver.py Outdated Show resolved Hide resolved

rmarow added 8 commits January 21, 2026 18:49

remove description

552cb32

Merge branch 'dataone-type-tests' into 76-create-dataone-inputtype

5e6ef20

tests in other branch

d213c43

remove inaccurate line

e529c4c

use filename to select dataset

dbd3119

commit function that was missing in last commit :)

ca10db7

Move function indentation

5d7b43d

rmarow mentioned this pull request Jan 27, 2026

accomodate datasets w/ duplicate filenames, but in different subdirectories for DataONE recipe input type #149

Open

rmarow added 7 commits February 4, 2026 10:42

Merge branch 'main' into 76-create-dataone-inputtype

30aef38

Resolve mypy error issues from merging with main

47b3357

resolver tests

ff21ab3

More tests

2772b2e

Fix mypy error

0a304c0

updated tests

89cf334

remove stop on first failure line (didnt mean to add)

188cccf

76 Create dataone input type #147

Are you sure you want to change the base?

76 Create dataone input type #147

Uh oh!

Conversation

rmarow commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmarow Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

trey-stafford Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trey-stafford Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

mbjones Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

rmarow Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trey-stafford left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trey-stafford Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trey-stafford commented Jan 15, 2026

Uh oh!

rmarow commented Jan 16, 2026

Uh oh!

Uh oh!

rushirajnenuji Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

rushirajnenuji commented Jan 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rmarow commented Jan 15, 2026 •

edited

Loading

trey-stafford Jan 15, 2026 •

edited

Loading

trey-stafford Jan 15, 2026 •

edited

Loading