Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/dspace harvest #63

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

ivanmrsulja
Copy link

@ivanmrsulja ivanmrsulja commented Dec 12, 2024

Added an example of fetching publication metadata from DSpace based on the oaifetch.

Steps to run:

  • Examine the run-oaifetch.sh (or .bat) script and type the location of the VIVO Harvester installation directory.
  • You can uncomment the score and match functions if you want deduplication to be performed (it is turned off by default)
  • Modify the vivo.model.xml file to provide parameters for accessing your VIVO web application.
  • Modify the dspace-oaifetch.conf.xml to point to your desired instance's endpoint, as well as other OAI properties (keep in mind that only DublinCore metadata format is supported at this moment)
  • Shut down your VIVO instance (in order to free the TDB lock)
  • Run run-dspace-oaifetch.sh (or .bat, if you are on Windows)
  • Restart your VIVO instance and reindex the search indexes

Closes #4021

@ivanmrsulja ivanmrsulja requested a review from wwtamu December 12, 2024 15:50
Copy link
Contributor

@chenejac chenejac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanmrsulja thanks for this nice contribution. Please check my two comments. Also, in the description of the PR please correct name of the sh/bat file which should be run (run-dspace-oaifetch.bat).

@ivanmrsulja ivanmrsulja requested a review from chenejac December 20, 2024 08:12
@chenejac chenejac requested a review from bkampe December 23, 2024 09:54
Copy link
Contributor

@chenejac chenejac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanmrsulja thanks for this contribution.

I have tested the PR on Windows 10. Initially, there was an issue with encoding which Ivan fixed. It works very well now including both ingestion approaches - tdb based and sparql api based.

…ship and type bugs when performing TDB import. Fixed SPARQL update encoding issue.
Copy link
Contributor

@chenejac chenejac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanmrsulja well done

@@ -193,7 +193,9 @@ public void execute() throws IOException {

if (! StringUtils.equalsIgnoreCase(strArray[1], "deleted")) {
log.trace("Adding record: " + strArray[0]);
this.rhOutput.addRecord(strArray[0], strArray[1], this.getClass());
String charReferenceRegex = "(?<=^|[^&])(&#(?:[0-9]+|x[0-9a-fA-F]+);)";
String fullyEscapedData = strArray[1].replaceAll(charReferenceRegex, "&amp;$1").replace("&amp;&#", "&amp;#");
Copy link

@wwtamu wwtamu Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is going on with the formatting here, but I think the regex should be a constant Pattern to avoid compilation of the regex repeatedly in iteration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I will fix this ASAP! The regex actually checks for every instance of unescaped HTML predefined entities so I can escape them properly in order to avoid encoding issues in further ETL stages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Harvesting metadata from DSpace
3 participants