-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
10909 Support for OAI-PMH harvesting from DataCite #11011
base: develop
Are you sure you want to change the base?
Conversation
…arvested datasets. #10909. (that whole block of extra checks on the harvest "style" may be redundant by now - I'll think about it)
Resolved conflicts: src/main/resources/db/migration/V6.4.0.1.sql
This comment has been minimized.
This comment has been minimized.
…9-datacite-oai-harvesting
…p, since it's already got a script with .2 in the name. #10909
This comment has been minimized.
This comment has been minimized.
Resolved conflicts: src/main/java/edu/harvard/iq/dataverse/api/imports/ImportGenericServiceBean.java src/main/java/edu/harvard/iq/dataverse/api/imports/ImportServiceBean.java src/main/java/edu/harvard/iq/dataverse/harvest/client/HarvestingClient.java src/main/java/edu/harvard/iq/dataverse/util/json/JsonParser.java src/main/java/edu/harvard/iq/dataverse/util/json/JsonPrinter.java src/main/resources/db/migration/V6.4.0.3.sql
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
… to build with a custom version of xoai. this pr will not be merged until the extra features are added to a gdcc-supplied version of the library (snapshot or otherwise), and these local jars will be removed. #10909
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
The last Jenkins failure was from my new test added as part of this PR. It was the result of a conflict with something recently merged into develop (the last Jenkins run was triggered by syncing the branch with develop). Fixing now. |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me. I ignored all of the local library changes though, and did not test. Should we add "Waiting" to this so it doesn't go through QA with the local stuff?
assertNotNull(clientStatus); | ||
|
||
if ("inProgress".equals(clientStatus) || "IN PROGRESS".equals(responseJsonPath.getString("data.lastResult"))) { | ||
// we'll sleep for another second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 seconds, as you corrected in the other test
|
||
DataCite maintains an OAI server (https://oai.datacite.org/oai) that serves records for every DOI they have registered. There's been a lot of interest in the community in being able to harvest from them. This way, it will be possible to harvest metadata from institution X even if the institution X does not maintain an OAI server of their own, if they happen to register their DOIs with DataCite. One extra element of this harvesting model that makes it especially powerful and flexible is the DataCite's concept of a "dynamic OAI set": a harvester is not limited to harvesting the pre-defined set of ALL the records registered by the Institution X, but can instead harvest virtually any arbitrary subset thereof; any query that the DataCite search API understands can be used as an OAI set (!). | ||
|
||
A few technical issues had to be resolved in the process of adding this functionality so, as of this release it is being offered as somewhat experimental. Its beta version is nevertheless already in use at IQSS with seemingly satisfactory results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside re: our "experimental" label. There are groups who avoid these features assuming they might go away/who are surprised to hear that Harvard is using them and is unlikely to drop them completely. We might want to think about removing the 'experimental' language after one version, or changing to 'cutting edge' or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Also, the label may be unnecessary in this case. Not that I'm 100% positive it's not going to fail for anyone; but I don't think it can cause any real damage either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got rid of the "experimental" part and otherwise slightly refined the docs.
Good question. The xoai-side changes are simple; even if Oliver makes me re-implement them from scratch, as I expect... Ok, let me think about it, but I'll probably add a "waiting" label, but then communicate to Omer that it could make sense to get a head start on testing, if he has spare cycles. |
This comment has been minimized.
This comment has been minimized.
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
What this PR does / why we need it:
The underlying goal was to be able to harvest metadata directly from DataCite. Which makes it possible for a Dataverse instance to harvest datasets from institutions and schools who don't maintain their own OAI servers, as long as they register their DOIs with DataCite.
2 major features needed to be added to accommodate this, as described in the linked issue (although one has since been added as a standalone PR #11049 and is already in 6.5). On top of that a few other fixes and improvements have been added. (for example, it is now possible to schedule harvests via the API - this has been a GUI-only feature until now)
Note that everything in this PR is already in prod. use at IQSS via a deployment of a custom experimental patch of v6.5. This had to be done in the context of ongoing collaborations to accommodate the relevant deadlines. The 2 prod. collections involved are:
https://dataverse.harvard.edu/dataverse/bertarelli (note that in this instance the harvested content is included in their subcollections alongside "real", locally-deposited datasets)
https://dataverse.harvard.edu/dataverse/designsafe
Which issue(s) this PR closes:
Special notes for your reviewer:
I'm about to mark this PR "ready for review". This is true as far as the underlying Dataverse code is concerned however, at the moment the branch is built with a local copy of the customized xoai jars. This is temporary, pending the needed changes being incorporated into a gdcc-released version which is something that needs to happen before this PR is merged.
Suggestions on how to test this:
See the release note and the API guide.
The following is an example harvesting client configuration that will harvest a set made from a single dataset, Gary King's doi:10.7910/DVN/9L6A8X:
The magic behind the set name in the configuration above, that allows to harvest just this specific dataset:
The native DataCite API query:
https://api.datacite.org/dois?query=doi:10.7910/DVN/9L6A8X
Encoding the query definition in base64:
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: