Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 82 additions & 87 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -1,112 +1,107 @@
# Architecture of SHARE/Trove
> NOTE: this document requires update (big ol' TODO)

# Architecture of SHARE/trove

This document is a starting point and reference to familiarize yourself with this codebase.

## Bird's eye view
In short, SHARE/Trove takes metadata records (in any supported input format),
ingests them, and makes them available in any supported output format.
```
┌───────────────────────────────────────────┐
│ Ingest │
│ ┌──────┐ │
│ ┌─────────────────────────┐ ┌──►Format├─┼────┐
│ │ Normalize │ │ └──────┘ │ │
│ │ │ │ │ ▼
┌───────┐ │ │ ┌─────────┐ ┌────────┐ │ │ ┌──────┐ │ save as
│Harvest├─┬─┼─┼─►Transform├──►Regulate├─┼─┬─┼──►Format├─┼─┬─►FormattedMetadataRecord
└───────┘ │ │ │ └─────────┘ └────────┘ │ │ │ └──────┘ │ │
│ │ │ │ │ . │ │ ┌───────┐
│ │ └─────────────────────────┘ │ . │ └──►Indexer│
│ │ │ . │ └───────┘
│ └─────────────────────────────┼─────────────┘ some formats also
│ │ indexed separately
▼ ▼
save as save as
RawDatum NormalizedData
In short, SHARE/trove holds metadata records that describe things and makes those records available for searching, browsing, and subscribing.

![overview of shtrove: metadata records in, search/browse/subscribe out](./project/static/img/shtroverview.png)


## Parts
a look at the tangles of communication between different parts of the system:

```mermaid
graph LR;
subgraph shtrove;
subgraph web[api/web server];
ingest;
search;
browse;
rss;
atom;
oaipmh;
end;
worker["background worker (celery)"];
indexer["indexer daemon"];
rabbitmq["task queue (rabbitmq)"];
postgres["database (postgres)"];
elasticsearch;
web---rabbitmq;
web---postgres;
web---elasticsearch;
worker---rabbitmq;
worker---postgres;
worker---elasticsearch;
indexer---rabbitmq;
indexer---postgres;
indexer---elasticsearch;
end;
source["metadata source (e.g. osf.io backend)"];
user["web user, either by browsing directly or via web app (like osf.io)"];
subscribers["feed subscription tools"];
source-->ingest;
user-->search;
user-->browse;
subscribers-->rss;
subscribers-->atom;
subscribers-->oaipmh;
```

## Code map

A brief look at important areas of code as they happen to exist now.

### Static configuration

`share/schema/` describes the "normalized" metadata schema/format that all
metadata records are converted into when ingested.

`share/sources/` describes a starting set of metadata sources that the system
could harvest metadata from -- these will be put in the database and can be
updated or added to over time.

`project/settings.py` describes system-level settings which can be set by
environment variables (and their default values), as well as settings
which cannot.

`share/models/` describes the data layer using the [Django](https://www.djangoproject.com/) ORM.

`share/subjects.yaml` describes the "central taxonomy" of subjects allowed
in `Subject.name` fields of `NormalizedData`.

### Harvest and ingest

`share/harvest/` and `share/harvesters/` describe how metadata records
are pulled from other metadata repositories.

`share/transform/` and `share/transformers/` describe how raw data (possibly
in any format) are transformed to the "normalized" schema.
- `trove`: django app for rdf-based apis
- `trove.digestive_tract`: most of what happens after ingestion
- stores records and identifiers in the database
- initiates indexing
- `trove.extract`: parsing ingested metadata records into resource descriptions
- `trove.derive`: from a given resource description, create special non-rdf serializations
- `trove.render`: from an api response modeled as rdf graph, render the requested mediatype
- `trove.models`: database models for identifiers and resource descriptions
- `trove.trovesearch`: builds rdf-graph responses for trove search apis (using `IndexStrategy` implementations from `share.search`)
- `trove.vocab`: identifies and describes concepts used elsewhere
- `trove.vocab.trove`: describes types, properties, and api paths in the trove api
- `trove.vocab.osfmap`: describes metadata from osf.io (currently the only metadata ingested)
- `trove.openapi`: generate openapi json for the trove api from thesaurus in `trove.vocab.trove`
- `share`: django app with search indexes and remnants of sharev2
- `share.models`: database models for external sources, users, and other system book-keeping
- `share.oaipmh`: provide data via [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html)
- `share.search`: all interaction with elasticsearch
- `share.search.index_strategy`: abstract base class `IndexStrategy` with multiple implementations, for different approaches to indexing the same data
- `share.search.daemon`: the "indexer daemon", an optimized background worker for batch-processing updates and sending to all active index strategies
- `share.search.index_messenger`: for sending messages to the indexer daemon
- `api`: django app with remnants of the legacy sharev2 api
- `api.views.feeds`: allows custom RSS and Atom feeds
- otherwise, subject to possible deprecation
- `osf_oauth2_adapter`: django app for login via osf.io
- `project`: the actual django project
- default settings at `project.settings`
- pulls together code from other directories implemented as django apps (`share`, `trove`, `api`, and `osf_oauth2_adapter`)

`share/regulate/` describes rules which are applied to every normalized datum,
regardless where or what format it originally come from.

`share/metadata_formats/` describes how a normalized datum can be formatted
into any supported output format.

`share/tasks/` runs the harvest/ingest pipeline and stores each task's status
(including debugging info, if errored) as a `HarvestJob` or `IngestJob`.

### Outward-facing views

`share/search/` describes how the search indexes are structured, managed, and
updated when new metadata records are introduced -- this provides a view for
discovering items based on whatever search criteria.

`share/oaipmh/` describes the [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html)
view for harvesting metadata from SHARE/Trove in bulk.

`api/` describes a mostly REST-ful API that's useful for inspecting records for
a specific item of interest.

### Internals

`share/admin/` is a Django-app for administrative access to the SHARE database
and pipeline logs

`osf_oauth2_adapter/` is a Django app to support logging in to SHARE via OSF
## Cross-cutting concerns

### Testing
### Resource descriptions

`tests/` are tests.
Uses the [resource description framework](https://www.w3.org/TR/rdf11-primer/#section-Introduction):
- the content of each ingested metadata record is an rdf graph focused on a specific resource
- all api responses from `trove` views are (experimentally) modeled as rdf graphs, which may be rendered a variety of ways

## Cross-cutting concerns
### Identifiers

### Immutable metadata
Whenever feasible, use full URI strings to identify resources, concepts, types, and properties that may be exposed outwardly.

Metadata records at all stages of the pipeline (`RawDatum`, `NormalizedData`,
`FormattedMetadataRecord`) should be considered immutable -- any updates
result in a new record being created, not an old record being altered.
Prefer using open, standard, well-defined namespaces wherever possible ([DCAT](https://www.w3.org/TR/vocab-dcat-3/) is a good place to start; see `trove.vocab.namespaces` for others already in use). When app-specific concepts must be defined, use the `TROVE` namespace (`https://share.osf.io/vocab/2023/trove/`).

Multiple records which describe the same item/object are grouped by a
"source-unique identifier" or "suid" -- essentially a two-tuple
`(source, identifier)` that uniquely and persistently identifies an item in
the source repository. In most outward-facing views, default to showing only
the most recent record for each suid.
A notable exception (non-URI identifier) is the "source-unique identifier" or "suid" -- essentially a two-tuple `(source, identifier)` that uniquely and persistently identifies a metadata record in a source repository. This `identifier` may be any string value, provided by the external source.

### Conventions
(an incomplete list)

- functions prefixed `pls_` ("please") are a request for something to happen
- local variables prefixed with underscore (to consistently distinguish between internal-only names and those imported/built-in)
- prefer full type annotations in python code, wherever reasonably feasible

## Why this?
inspired by [this writeup](https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html)
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Change Log

# [25.5.1] - 2025-08-21
- improve error handling in celery task-result backend
- use logging config in celery worker
- improve code docs (README.md et al.)

# [25.5.0] - 2025-07-15
- use python 3.13
- use `poetry` to manage dependencies
Expand Down
15 changes: 13 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,18 @@
# CONTRIBUTING

TODO: how do we want to guide community contributors?
> note: this codebase is currently (and historically) rather entangled with [osf.io](https://osf.io), which has its shtrove at https://share.osf.io -- stay tuned for more-reusable open-source libraries and tools that should be more accessible to community contribution

For now, if you're interested in contributing to SHARE/Trove, feel free to
For now, if you're interested in contributing to SHARE/trove, feel free to
[open an issue on github](https://github.com/CenterForOpenScience/SHARE/issues)
and start a conversation.

## Required checks

All changes must pass the following checks with no errors:
- linting: `python -m flake8`
- static type-checking (on `trove/` code only, for now): `python -m mypy trove`
- tests: `python -m pytest -x tests/`
- note: some tests require other services running -- if [using the provided docker-compose.yml](./how-to/run-locally.md), recommend running in the background (upping worker ups all: `docker compose up -d worker`) and executing tests from within one of the python containers (`indexer`, `worker`, or `web`):
`docker compose exec indexer python -m pytest -x tests/`

All new changes should also avoid decreasing test coverage, when reasonably possible (currently checked on github pull requests).
34 changes: 9 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,17 @@
# SHARE/Trove
# SHARE/trove (aka SHARtrove, shtrove)

SHARE is creating a free, open dataset of research (meta)data.
> share (verb): to have or use in common.

> **Note**: SHARE’s open API tools and services help bring together scholarship distributed across research ecosystems for the purpose of greater discoverability. However, SHARE does not guarantee a complete aggregation of searched outputs. For this reason, SHARE results should not be used for methodological analyses, such as systematic reviews.
> trove (noun): a store of valuable or delightful things.

[![Coverage Status](https://coveralls.io/repos/github/CenterForOpenScience/SHARE/badge.svg?branch=develop)](https://coveralls.io/github/CenterForOpenScience/SHARE?branch=develop)
SHARE/trove (aka SHARtrove, shtrove) is is a service meant to store (meta)data you wish to keep and offer openly.

## Documentation
note: this codebase is currently (and historically) rather entangled with [osf.io](https://osf.io), which has its shtrove at https://share.osf.io -- stay tuned for more-reusable open-source libraries and tools for working with (meta)data

### What is this?
see [WHAT-IS-THIS-EVEN.md](./WHAT-IS-THIS-EVEN.md)
see [ARCHITECTURE.md](./ARCHITECTURE.md) for help navigating this codebase

### How can I use it?
see [how-to/use-the-api.md](./how-to/use-the-api.md)
see [CONTRIBUTING.md](./CONTRIBUTING.md) for info about contributing changes

### How do I navigate this codebase?
see [ARCHITECTURE.md](./ARCHITECTURE.md)

### How do I run a copy locally?
see [how-to/run-locally.md](./how-to/run-locally.md)


## Running Tests

### Unit test suite

py.test

### BDD Suite

behave
see [how-to/use-the-api.md](./how-to/use-the-api.md) for help using the api to add and access (meta)data

see [how-to/run-locally.md](./how-to/run-locally.md) for help running a shtrove instance for local development
86 changes: 86 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# TODO:
ways to better this mess

## better shtrove api experience

- better web-browsing experience
- when `Accept` header accepts html, use html regardless of query-params
- when query param `acceptMediatype` requests another mediatype, display on page in copy/pastable way
- exception: when given `withFileName`, download without html wrapping
- exception: `/trove/browse` should still give hypertext with clickable links
- include more explanatory docs (and better fill out those explanations)
- more helpful (less erratic) visual design
- in each html rendering of an api response, include a `<form>` for adding/editing/viewing query params
- better tsv/csv experience
- set default columns for `index-value-search` (and/or broadly improve `fields` handling)
- better turtle experience
- quoted literal graphs also turtle
- omit unnecessary `^^rdf:string`
- better jsonld experience
- provide `@context` (via header, at least)
- accept jsonld at `/trove/ingest` (or at each `ldp:inbox`...)


## modular packaging
move actually-helpful logic into separate packages that can be used and maintained independently of
any particular web app/api/framework (and then use those packages in shtrove and osf)

- `osfmap`: standalone OSFMAP definition
- define osfmap properties and shapes (following DCTAP) in static tsv files
- use `tapshoes` (below) to generate docs and helpful utility functions
- may replace/simplify:
- `osf.metadata.osf_gathering.OSFMAP` (and related constants)
- `trove.vocab.osfmap`
- `trove.derive.osfmap_json`
- `tapshoes`: for using and packaging [tabular application profiles](https://dcmi.github.io/dctap/) in python
- take a set of tsv/csv files as input
- should support any valid DCTAP (aim to be worth community interest)
- initial/immediate use case `osfmap`
- generate more human-readable docs of properties and shapes/types
- validate a given record (rdf graph) against a profile
- serialize a valid record in a consistent/stable way (according to the profile)
- enable publishing "official" application profiles as installable python packages
- learn from and consider using prior dctap work:
- dctap-python: https://pypi.org/project/dctap/
- loads tabular files into more immediately usable form
- tap2shacl: https://pypi.org/project/tap2shacl/
- builds shacl constraints from application profile
- could then validate a given graph with pyshacl: https://pypi.org/project/pyshacl/
- metadata record crosswalk/serialization
- given a record (as rdf graph) and application profile to which it conforms (like OSFMAP), offer:
- crosswalking to a standard vocab (DCAT, schema.org, ...)
- stable rdf serialization (json-ld, turtle, xml, ...)
- special bespoke serialization (datacite xml/json, oai_dc, ...)
- may replace/simplify:
- `osf.metadata.serializers`
- `trove.derive`
- `shtrove`: reusable package with the good parts of share/trove
- python api and command-line tools
- given application profile
- digestive tract with pluggable storage/indexing interfaces
- methods for ingest, search, browse, subscribe
- `django-shtrove`: django wrapper for `shtrove` functionality
- set application profile via django setting
- django models for storage, elasticsearch for indexing
- django views for ingest, search, browse, subscribe


## open web standards
- data catalog vocabulary (DCAT) https://www.w3.org/TR/vocab-dcat-3/
- an appropriate (and better thought-thru) vocab for a lot of what shtrove does
- already used in some ways, but would benefit from adopting more thoroughly
- replace bespoke types (like `trove:Indexcard`) with better-defined dcat equivalents (like `dcat:CatalogRecord`)
- rename various properties/types/variables similarly
- "catalog" vs "index"
- "record" vs "card"
- replace checksum-iris with `spdx:checksum` (added in dcat 3)
- linked data notifications (LDN) https://www.w3.org/TR/ldn/
- shtrove incidentally (partially) aligns with linked-data principles -- could lean into that
- replace `/trove/ingest` with one or more `ldp:inbox` urls
- trove index-card like an inbox containing current/past resource descriptions
```
<://osf.example/blarg> ldp:inbox <://shtrove.example/index-card/0000-00...> .
<://shtrove.example/index-card/0000-00...> ldp:contains <://shtrove.example/description/0000-00...> .
<://shtrove.example/description/0000-00...> foaf:primaryTopic <://osf.example/blarg>
```
(might consider renaming "index-card" for consistency/clarity)
Loading