Processing pipeline #21

bootsa · 2022-05-24T13:27:18Z

bootsa
May 24, 2022
Maintainer

Current

The current processing pipeline:

SPARQL query fetched through HTTP request, returns a JSON listing of all Wikidata entities that are within the scope of the WikiProject Invasion Biology and have been tagged with an open license. Saved to file.
Post processing script (Deno):
- reads JSON file of entries
- loops through each entry
  - pulls Wikidata entity through CitationJS
  - checks if there is a DOI, if yes:
    - retrieves Crossref item
  - processes Wikidata (and Crossref if present) entity into XML and writes to file system
- Toolforge server (that hosts OAI-PMH endpoint) webhook called that git pulls the updates onto the server

There's a github action set up to run this regularly though it has an issue that is stopping it from successfully running #14 and is probably not an effective way of running it (better to be run on demand - for instance, when entries are updated or a versioned dump is created)

It's a pretty rudimentary approach that has allowed for quick(ish) prototyping but is not a very satisfactory solution on a number of accounts:

a simple looping system was used to prevent overwhelming the Wikidata API endpoint (called through CitationJS) - batch processing should be possible but requires a fairly in-depth refactor
all entries are processed in order, whether or not they have changed - very time and processing inefficient. Some simple checks could be used to alleviate this, though this might miss updates in linked entries where the main entry hasn't changed.
rather ugly, would be far nicer and more maintainable to perform much of the system through a CitationJS plugin - using the current Wikidata plugin as a base
works on live data rather than a defined changeset - one idea would be to process from specific data dumps, or pull RDF (TTL) files of the SPARQL entries and work directly from these (which would also alleviate the Wikidata API bottleneck issue).
not easily expandable and prone to breaking if any part of the chain is disrupted.

Future

Ideally a modular system where new functionality can be added and each module can be changed without needing to refactor other modules.

This would entail creating a processing pipeline that uses a common context to store a reusable state that is accessible and updatable from each of the modules.

Each module should be immutable and specific customisation declared through configuration files.

Here's a rather simplified initial structure that I'll expand over time:

flowchart TD
    subgraph "Trigger Action"
    TA1{Cron Timer}
    TA2{"HTTP Call
(Button Push)"}
    TA3{Change Listener}
    end
    subgraph "Get Items"
    A["SPARQL Query (streaming)"]
    end
    subgraph sources
    B["Query sources"]
    C("Collate data")
    end
    subgraph save output
    D("Save source metadata")
    end
    subgraph process
    E("Transform to standardised data tree")
    end
    subgraph save output
    F("Save transformed metadata")
    end
    subgraph oai-pmh
    G("generate XML format")
    end
    subgraph save output
    H("Save transformed metadata")
    end
    subgraph log errors
        Log("log issues")
    end
    TA1 & TA2 & TA3 --> A
    A -- for each item --> B --> C --> D --> E --> F --> G --> H
    E --> Log

There are a number of enhancements that this will bring or that we can take the opportunity to implement including:

easier to integrate a versioning system
incrementally add functionality (relatively) easily
a more generalised approach that can be used by other communities (for instance, a customisable OAI-PMH endpoint for Wikibase)
separate data from code
remove reliance on proprietary Github Actions system and move to a service that can be run in multi different environments (dedicated server, GH Action, locally, etc)

Obstacles

Currently all of an item's metadata is retrieved in unstructured JSON objects (wikidata items are retrieved using CitationJS, other sources are JSON producing REST APIs).

To increase malleability and future usage, it would be preferable to work with RDF / structured data using suitable ontologies.

To do this we would need to:

Wikidata: remove the dependency on CitationJS or alter it's functioning (e.g. by creating a new plugin that works directly on RDF data rather than the Wikidata API).
- The CitationJS Wikidata plugin provides some useful processing (for instance, transforming author names to some extent) which might need to be reproduced if CitationJS is removed completely.
- directly fetching the Wikidata entity's RDF representation (.ttl, .json-ld, etc url) could be useful if it provides sufficient data
- alternatively a specific SPARQL query could be used
other REST APIs: inject context into / transform the various REST API responses (we could log any instances where the returned response does not conform to expected structure, say if the API response structure is changed)

Transition / next steps

see modular system refactor #23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing pipeline #21

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Processing pipeline #21

bootsa May 24, 2022 Maintainer

Current

Future

Obstacles

Transition / next steps

Replies: 0 comments

bootsa
May 24, 2022
Maintainer