You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SPARQL query fetched through HTTP request, returns a JSON listing of all Wikidata entities that are within the scope of the WikiProject Invasion Biology and have been tagged with an open license. Saved to file.
Post processing script (Deno):
reads JSON file of entries
loops through each entry
pulls Wikidata entity through CitationJS
checks if there is a DOI, if yes:
retrieves Crossref item
processes Wikidata (and Crossref if present) entity into XML and writes to file system
Toolforge server (that hosts OAI-PMH endpoint) webhook called that git pulls the updates onto the server
There's a github action set up to run this regularly though it has an issue that is stopping it from successfully running #14 and is probably not an effective way of running it (better to be run on demand - for instance, when entries are updated or a versioned dump is created)
It's a pretty rudimentary approach that has allowed for quick(ish) prototyping but is not a very satisfactory solution on a number of accounts:
a simple looping system was used to prevent overwhelming the Wikidata API endpoint (called through CitationJS) - batch processing should be possible but requires a fairly in-depth refactor
all entries are processed in order, whether or not they have changed - very time and processing inefficient. Some simple checks could be used to alleviate this, though this might miss updates in linked entries where the main entry hasn't changed.
rather ugly, would be far nicer and more maintainable to perform much of the system through a CitationJS plugin - using the current Wikidata plugin as a base
works on live data rather than a defined changeset - one idea would be to process from specific data dumps, or pull RDF (TTL) files of the SPARQL entries and work directly from these (which would also alleviate the Wikidata API bottleneck issue).
not easily expandable and prone to breaking if any part of the chain is disrupted.
Future
Ideally a modular system where new functionality can be added and each module can be changed without needing to refactor other modules.
This would entail creating a processing pipeline that uses a common context to store a reusable state that is accessible and updatable from each of the modules.
Each module should be immutable and specific customisation declared through configuration files.
Here's a rather simplified initial structure that I'll expand over time:
flowchart TD
subgraph "Trigger Action"
TA1{Cron Timer}
TA2{"HTTP Call
(Button Push)"}
TA3{Change Listener}
end
subgraph "Get Items"
A["SPARQL Query (streaming)"]
end
subgraph sources
B["Query sources"]
C("Collate data")
end
subgraph save output
D("Save source metadata")
end
subgraph process
E("Transform to standardised data tree")
end
subgraph save output
F("Save transformed metadata")
end
subgraph oai-pmh
G("generate XML format")
end
subgraph save output
H("Save transformed metadata")
end
subgraph log errors
Log("log issues")
end
TA1 & TA2 & TA3 --> A
A -- for each item --> B --> C --> D --> E --> F --> G --> H
E --> Log
Loading
There are a number of enhancements that this will bring or that we can take the opportunity to implement including:
a more generalised approach that can be used by other communities (for instance, a customisable OAI-PMH endpoint for Wikibase)
separate data from code
remove reliance on proprietary Github Actions system and move to a service that can be run in multi different environments (dedicated server, GH Action, locally, etc)
Obstacles
Currently all of an item's metadata is retrieved in unstructured JSON objects (wikidata items are retrieved using CitationJS, other sources are JSON producing REST APIs).
To increase malleability and future usage, it would be preferable to work with RDF / structured data using suitable ontologies.
To do this we would need to:
Wikidata: remove the dependency on CitationJS or alter it's functioning (e.g. by creating a new plugin that works directly on RDF data rather than the Wikidata API).
The CitationJS Wikidata plugin provides some useful processing (for instance, transforming author names to some extent) which might need to be reproduced if CitationJS is removed completely.
directly fetching the Wikidata entity's RDF representation (.ttl, .json-ld, etc url) could be useful if it provides sufficient data
alternatively a specific SPARQL query could be used
other REST APIs: inject context into / transform the various REST API responses (we could log any instances where the returned response does not conform to expected structure, say if the API response structure is changed)
This discussion was converted from issue #16 on September 19, 2022 10:35.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Current
The current processing pipeline:
There's a github action set up to run this regularly though it has an issue that is stopping it from successfully running #14 and is probably not an effective way of running it (better to be run on demand - for instance, when entries are updated or a versioned dump is created)
It's a pretty rudimentary approach that has allowed for quick(ish) prototyping but is not a very satisfactory solution on a number of accounts:
Future
Ideally a modular system where new functionality can be added and each module can be changed without needing to refactor other modules.
This would entail creating a processing pipeline that uses a common context to store a reusable state that is accessible and updatable from each of the modules.
Each module should be immutable and specific customisation declared through configuration files.
Here's a rather simplified initial structure that I'll expand over time:
There are a number of enhancements that this will bring or that we can take the opportunity to implement including:
Obstacles
Currently all of an item's metadata is retrieved in unstructured JSON objects (wikidata items are retrieved using CitationJS, other sources are JSON producing REST APIs).
To increase malleability and future usage, it would be preferable to work with RDF / structured data using suitable ontologies.
To do this we would need to:
Transition / next steps
Beta Was this translation helpful? Give feedback.
All reactions