Processing pipeline - current and future #16

bootsa · 2022-05-24T13:27:18Z

The current processing pipeline:

SPARQL query fetched through HTTP request, returns a JSON listing of all Wikidata entities that are within the scope of the WikiProject Invasion Biology and have been tagged with an open license. Saved to file.
Post processing script (Deno):
- reads JSON file of entries
- loops through each entry
  - pulls Wikidata entity through CitationJS
  - checks if there is a DOI, if yes:
    - retrieves Crossref item
  - processes Wikidata (and Crossref if present) entity into XML and writes to file system
- Toolforge server (that hosts OAI-PMH endpoint) webhook called that git pulls the updates onto the server

There's a github action set up to run this regularly though it has an issue that is stopping it from successfully running #14 and is probably not an effective way of running it (better to be run on demand - for instance, when entries are updated or a versioned dump is created)

It's a pretty rudimentary approach that has allowed for quick(ish) prototyping but is not a very satisfactory solution on a number of accounts:

a simple looping system was used to prevent overwhelming the Wikidata API endpoint (called through CitationJS) - batch processing should be possible but requires a fairly in-depth refactor
all entries are processed in order, whether or not they have changed - very time and processing inefficient. Some simple checks could be used to alleviate this, though this might miss updates in linked entries where the main entry hasn't changed.
rather ugly, would be far nicer and more maintainable to perform much of the system through a CitationJS plugin - using the current Wikidata plugin as a base
works on live data rather than a defined changeset - one idea would be to process from specific data dumps, or pull RDF (TTL) files of the SPARQL entries and work directly from these (which would also alleviate the Wikidata API bottleneck issue).

bootsa self-assigned this Aug 10, 2022

InvasionBiologyHypotheses locked and limited conversation to collaborators Sep 19, 2022

bootsa converted this issue into discussion #21 Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Processing pipeline - current and future #16

Processing pipeline - current and future #16

bootsa commented May 24, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Processing pipeline - current and future #16

Processing pipeline - current and future #16

Comments

bootsa commented May 24, 2022

This issue was moved to a discussion.