Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10909 Support for OAI-PMH harvesting from DataCite #11011

Open
wants to merge 34 commits into
base: develop
Choose a base branch
from

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Nov 8, 2024

What this PR does / why we need it:

The underlying goal was to be able to harvest metadata directly from DataCite. Which makes it possible for a Dataverse instance to harvest datasets from institutions and schools who don't maintain their own OAI servers, as long as they register their DOIs with DataCite.

2 major features needed to be added to accommodate this, as described in the linked issue (although one has since been added as a standalone PR #11049 and is already in 6.5). On top of that a few other fixes and improvements have been added. (for example, it is now possible to schedule harvests via the API - this has been a GUI-only feature until now)

Note that everything in this PR is already in prod. use at IQSS via a deployment of a custom experimental patch of v6.5. This had to be done in the context of ongoing collaborations to accommodate the relevant deadlines. The 2 prod. collections involved are:

https://dataverse.harvard.edu/dataverse/bertarelli (note that in this instance the harvested content is included in their subcollections alongside "real", locally-deposited datasets)
https://dataverse.harvard.edu/dataverse/designsafe

Which issue(s) this PR closes:

Special notes for your reviewer:

I'm about to mark this PR "ready for review". This is true as far as the underlying Dataverse code is concerned however, at the moment the branch is built with a local copy of the customized xoai jars. This is temporary, pending the needed changes being incorporated into a gdcc-released version which is something that needs to happen before this PR is merged.

Suggestions on how to test this:

See the release note and the API guide.

The following is an example harvesting client configuration that will harvest a set made from a single dataset, Gary King's doi:10.7910/DVN/9L6A8X:

{
    "useOaiIdentifiersAsPids": true,
    "useListRecords": true,
    "allowHarvestingMissingCVV": false,
    "set": "~ZG9pOjEwLjc5MTAvRFZOLzlMNkE4WAo=",
    "nickName": "harvest9L6A8X",
    "dataverseAlias": "INSERTYOURCOLLECTION",
    "type": "oai",
    "style": "default",
    "harvestUrl": "https://oai.datacite.org/oai",
    "archiveUrl": "https://oai.datacite.org",
    "archiveDescription": "The metadata for this Dataset was harvested from DataCite. Clicking the dataset link will take you directly to the original archival location, as registered with DataCite.",
    "metadataFormat": "oai_dc"
}

The magic behind the set name in the configuration above, that allows to harvest just this specific dataset:
The native DataCite API query:
https://api.datacite.org/dois?query=doi:10.7910/DVN/9L6A8X
Encoding the query definition in base64:

echo "doi:10.7910/DVN/9L6A8X" | base64
ZG9pOjEwLjc5MTAvRFZOLzlMNkE4WAo=

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@coveralls
Copy link

coveralls commented Nov 8, 2024

Coverage Status

coverage: 22.712% (-0.02%) from 22.736%
when pulling f50378a on 10909-datacite-oai-harvesting
into 2210d16 on develop.

This comment has been minimized.

This comment has been minimized.

@landreev landreev modified the milestone: 6.5 Nov 21, 2024
Resolved conflicts:
	src/main/java/edu/harvard/iq/dataverse/api/imports/ImportGenericServiceBean.java
	src/main/java/edu/harvard/iq/dataverse/api/imports/ImportServiceBean.java
	src/main/java/edu/harvard/iq/dataverse/harvest/client/HarvestingClient.java
	src/main/java/edu/harvard/iq/dataverse/util/json/JsonParser.java
	src/main/java/edu/harvard/iq/dataverse/util/json/JsonPrinter.java
	src/main/resources/db/migration/V6.4.0.3.sql

This comment has been minimized.

1 similar comment

This comment has been minimized.

@landreev landreev self-assigned this Feb 12, 2025
@landreev landreev added this to the 6.6 milestone Feb 12, 2025

This comment has been minimized.

@landreev landreev marked this pull request as ready for review February 18, 2025 16:45

This comment has been minimized.

@landreev
Copy link
Contributor Author

The last Jenkins failure was from my new test added as part of this PR. It was the result of a conflict with something recently merged into develop (the last Jenkins run was triggered by syncing the branch with develop). Fixing now.

This comment has been minimized.

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me. I ignored all of the local library changes though, and did not test. Should we add "Waiting" to this so it doesn't go through QA with the local stuff?

assertNotNull(clientStatus);

if ("inProgress".equals(clientStatus) || "IN PROGRESS".equals(responseJsonPath.getString("data.lastResult"))) {
// we'll sleep for another second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 seconds, as you corrected in the other test


DataCite maintains an OAI server (https://oai.datacite.org/oai) that serves records for every DOI they have registered. There's been a lot of interest in the community in being able to harvest from them. This way, it will be possible to harvest metadata from institution X even if the institution X does not maintain an OAI server of their own, if they happen to register their DOIs with DataCite. One extra element of this harvesting model that makes it especially powerful and flexible is the DataCite's concept of a "dynamic OAI set": a harvester is not limited to harvesting the pre-defined set of ALL the records registered by the Institution X, but can instead harvest virtually any arbitrary subset thereof; any query that the DataCite search API understands can be used as an OAI set (!).

A few technical issues had to be resolved in the process of adding this functionality so, as of this release it is being offered as somewhat experimental. Its beta version is nevertheless already in use at IQSS with seemingly satisfactory results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside re: our "experimental" label. There are groups who avoid these features assuming they might go away/who are surprised to hear that Harvard is using them and is unlikely to drop them completely. We might want to think about removing the 'experimental' language after one version, or changing to 'cutting edge' or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Also, the label may be unnecessary in this case. Not that I'm 100% positive it's not going to fail for anyone; but I don't think it can cause any real damage either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of the "experimental" part and otherwise slightly refined the docs.

@landreev
Copy link
Contributor Author

Looks OK to me. I ignored all of the local library changes though, and did not test. Should we add "Waiting" to this so it doesn't go through QA with the local stuff?

Good question. The xoai-side changes are simple; even if Oliver makes me re-implement them from scratch, as I expect... Ok, let me think about it, but I'll probably add a "waiting" label, but then communicate to Omer that it could make sense to get a head start on testing, if he has spare cycles.

This comment has been minimized.

@landreev landreev added the Status: Waiting for Related Issues/PRs This issue depends upon the completion of one or more issues/PRs label Feb 20, 2025
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10909-datacite-oai-harvesting
ghcr.io/gdcc/configbaker:10909-datacite-oai-harvesting

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@cmbz cmbz added FY25 Sprint 17 FY25 Sprint 17 (2025-02-12 - 2025-02-26) FY25 Sprint 18 FY25 Sprint 18 (2025-02-26 - 2025-03-12) labels Feb 26, 2025
@ofahimIQSS ofahimIQSS self-assigned this Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 17 FY25 Sprint 17 (2025-02-12 - 2025-02-26) FY25 Sprint 18 FY25 Sprint 18 (2025-02-26 - 2025-03-12) Status: Waiting for Related Issues/PRs This issue depends upon the completion of one or more issues/PRs
Projects
Status: QA ✅
Development

Successfully merging this pull request may close these issues.

Add support for OAI-harvesting from DataCite
5 participants