Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing pipeline - current and future #16

Closed
bootsa opened this issue May 24, 2022 · 0 comments
Closed

Processing pipeline - current and future #16

bootsa opened this issue May 24, 2022 · 0 comments
Assignees

Comments

@bootsa
Copy link
Collaborator

bootsa commented May 24, 2022

The current processing pipeline:

  • SPARQL query fetched through HTTP request, returns a JSON listing of all Wikidata entities that are within the scope of the WikiProject Invasion Biology and have been tagged with an open license. Saved to file.
  • Post processing script (Deno):
    • reads JSON file of entries
    • loops through each entry
      • pulls Wikidata entity through CitationJS
      • checks if there is a DOI, if yes:
        • retrieves Crossref item
      • processes Wikidata (and Crossref if present) entity into XML and writes to file system
    • Toolforge server (that hosts OAI-PMH endpoint) webhook called that git pulls the updates onto the server

There's a github action set up to run this regularly though it has an issue that is stopping it from successfully running #14 and is probably not an effective way of running it (better to be run on demand - for instance, when entries are updated or a versioned dump is created)

It's a pretty rudimentary approach that has allowed for quick(ish) prototyping but is not a very satisfactory solution on a number of accounts:

  • a simple looping system was used to prevent overwhelming the Wikidata API endpoint (called through CitationJS) - batch processing should be possible but requires a fairly in-depth refactor
  • all entries are processed in order, whether or not they have changed - very time and processing inefficient. Some simple checks could be used to alleviate this, though this might miss updates in linked entries where the main entry hasn't changed.
  • rather ugly, would be far nicer and more maintainable to perform much of the system through a CitationJS plugin - using the current Wikidata plugin as a base
  • works on live data rather than a defined changeset - one idea would be to process from specific data dumps, or pull RDF (TTL) files of the SPARQL entries and work directly from these (which would also alleviate the Wikidata API bottleneck issue).
@bootsa bootsa self-assigned this Aug 10, 2022
@InvasionBiologyHypotheses InvasionBiologyHypotheses locked and limited conversation to collaborators Sep 19, 2022
@bootsa bootsa converted this issue into discussion #21 Sep 19, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant