CenterForOpenScience · aaxelb · Jul 28, 2025 · Aug 11, 2025 · Jul 15, 2025 · Aug 14, 2025
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -1,112 +1,107 @@
-# Architecture of SHARE/Trove
-> NOTE: this document requires update (big ol' TODO)
-
+# Architecture of SHARE/trove
 
 This document is a starting point and reference to familiarize yourself with this codebase.
 
 ## Bird's eye view
-In short, SHARE/Trove takes metadata records (in any supported input format),
-ingests them, and makes them available in any supported output format.
-```
-            ┌───────────────────────────────────────────┐
-            │                  Ingest                   │
-            │                                  ┌──────┐ │
-            │ ┌─────────────────────────┐   ┌──►Format├─┼────┐
-            │ │        Normalize        │   │  └──────┘ │    │
-            │ │                         │   │           │    ▼
-┌───────┐   │ │ ┌─────────┐  ┌────────┐ │   │  ┌──────┐ │    save as
-│Harvest├─┬─┼─┼─►Transform├──►Regulate├─┼─┬─┼──►Format├─┼─┬─►FormattedMetadataRecord
-└───────┘ │ │ │ └─────────┘  └────────┘ │ │ │  └──────┘ │ │
-          │ │ │                         │ │ .           │ │  ┌───────┐
-          │ │ └─────────────────────────┘ │ .           │ └──►Indexer│
-          │ │                             │ .           │    └───────┘
-          │ └─────────────────────────────┼─────────────┘  some formats also
-          │                               │                indexed separately
-          ▼                               ▼
-        save as                         save as
-        RawDatum                        NormalizedData
+In short, SHARE/trove holds metadata records that describe things and makes those records available for searching, browsing, and subscribing.
+
+![overview of shtrove: metadata records in, search/browse/subscribe out](./project/static/img/shtroverview.png)
+
+
+## Parts
+a look at the tangles of communication between different parts of the system:
+
+```mermaid
+graph LR;
+    subgraph shtrove;
+        subgraph web[api/web server];
+            ingest;
+            search;
+            browse;
+            rss;
+            atom;
+            oaipmh;
+        end;
+        worker["background worker (celery)"];
+        indexer["indexer daemon"];
+        rabbitmq["task queue (rabbitmq)"];
+        postgres["database (postgres)"];
+        elasticsearch;
+        web---rabbitmq;
+        web---postgres;
+        web---elasticsearch;
+        worker---rabbitmq;
+        worker---postgres;
+        worker---elasticsearch;
+        indexer---rabbitmq;
+        indexer---postgres;
+        indexer---elasticsearch;
+    end;
+    source["metadata source (e.g. osf.io backend)"];
+    user["web user, either by browsing directly or via web app (like osf.io)"];
+    subscribers["feed subscription tools"];
+    source-->ingest;
+    user-->search;
+    user-->browse;
+    subscribers-->rss;
+    subscribers-->atom;
+    subscribers-->oaipmh;
 ```
 
 ## Code map
 
 A brief look at important areas of code as they happen to exist now.
 
-### Static configuration
-
-`share/schema/` describes the "normalized" metadata schema/format that all
-metadata records are converted into when ingested.
-
-`share/sources/` describes a starting set of metadata sources that the system
-could harvest metadata from -- these will be put in the database and can be
-updated or added to over time.
-
-`project/settings.py` describes system-level settings which can be set by
-environment variables (and their default values), as well as settings
-which cannot.
-
-`share/models/` describes the data layer using the [Django](https://www.djangoproject.com/) ORM.
-
-`share/subjects.yaml` describes the "central taxonomy" of subjects allowed
-in `Subject.name` fields of `NormalizedData`.
-
-### Harvest and ingest
-
-`share/harvest/` and `share/harvesters/` describe how metadata records
-are pulled from other metadata repositories.
-
-`share/transform/` and `share/transformers/` describe how raw data (possibly
-in any format) are transformed to the "normalized" schema.
+- `trove`: django app for rdf-based apis
+    - `trove.digestive_tract`: most of what happens after ingestion
+        - stores records and identifiers in the database
+        - initiates indexing
+    - `trove.extract`: parsing ingested metadata records into resource descriptions
+    - `trove.derive`: from a given resource description, create special non-rdf serializations
+    - `trove.render`: from an api response modeled as rdf graph, render the requested mediatype
+    - `trove.models`: database models for identifiers and resource descriptions
+    - `trove.trovesearch`: builds rdf-graph responses for trove search apis (using `IndexStrategy` implementations from `share.search`)
+    - `trove.vocab`: identifies and describes concepts used elsewhere
+        - `trove.vocab.trove`: describes types, properties, and api paths in the trove api
+        - `trove.vocab.osfmap`: describes metadata from osf.io (currently the only metadata ingested)
+    - `trove.openapi`: generate openapi json for the trove api from thesaurus in `trove.vocab.trove`
+- `share`: django app with search indexes and remnants of sharev2
+    - `share.models`: database models for external sources, users, and other system book-keeping
+    - `share.oaipmh`: provide data via [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html)
+    - `share.search`: all interaction with elasticsearch
+        - `share.search.index_strategy`: abstract base class `IndexStrategy` with multiple implementations, for different approaches to indexing the same data
+        - `share.search.daemon`: the "indexer daemon", an optimized background worker for batch-processing updates and sending to all active index strategies
+        - `share.search.index_messenger`: for sending messages to the indexer daemon
+- `api`: django app with remnants of the legacy sharev2 api
+    - `api.views.feeds`: allows custom RSS and Atom feeds
+    - otherwise, subject to possible deprecation
+- `osf_oauth2_adapter`: django app for login via osf.io
+- `project`: the actual django project
+    - default settings at `project.settings`
+    - pulls together code from other directories implemented as django apps (`share`, `trove`, `api`, and `osf_oauth2_adapter`)
 
-`share/regulate/` describes rules which are applied to every normalized datum,
-regardless where or what format it originally come from.
 
-`share/metadata_formats/` describes how a normalized datum can be formatted
-into any supported output format.
-
-`share/tasks/` runs the harvest/ingest pipeline and stores each task's status
-(including debugging info, if errored) as a `HarvestJob` or `IngestJob`.
-
-### Outward-facing views
-
-`share/search/` describes how the search indexes are structured, managed, and
-updated when new metadata records are introduced -- this provides a view for
-discovering items based on whatever search criteria.
-
-`share/oaipmh/` describes the [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html)
-view for harvesting metadata from SHARE/Trove in bulk.
-
-`api/` describes a mostly REST-ful API that's useful for inspecting records for
-a specific item of interest.
-
-### Internals
-
-`share/admin/` is a Django-app for administrative access to the SHARE database
-and pipeline logs
-
-`osf_oauth2_adapter/` is a Django app to support logging in to SHARE via OSF
+## Cross-cutting concerns
 
-### Testing
+### Resource descriptions
 
-`tests/` are tests.
+Uses the [resource description framework](https://www.w3.org/TR/rdf11-primer/#section-Introduction):
+- the content of each ingested metadata record is an rdf graph focused on a specific resource
+- all api responses from `trove` views are (experimentally) modeled as rdf graphs, which may be rendered a variety of ways
 
-## Cross-cutting concerns
+### Identifiers
 
-### Immutable metadata
+Whenever feasible, use full URI strings to identify resources, concepts, types, and properties that may be exposed outwardly.
 
-Metadata records at all stages of the pipeline (`RawDatum`, `NormalizedData`,
-`FormattedMetadataRecord`) should be considered immutable -- any updates 
-result in a new record being created, not an old record being altered.
+Prefer using open, standard, well-defined namespaces wherever possible ([DCAT](https://www.w3.org/TR/vocab-dcat-3/) is a good place to start; see `trove.vocab.namespaces` for others already in use). When app-specific concepts must be defined, use the `TROVE` namespace (`https://share.osf.io/vocab/2023/trove/`).
 
-Multiple records which describe the same item/object are grouped by a
-"source-unique identifier" or "suid" -- essentially a two-tuple
-`(source, identifier)` that uniquely and persistently identifies an item in
-the source repository. In most outward-facing views, default to showing only
-the most recent record for each suid.
+A notable exception (non-URI identifier) is the "source-unique identifier" or "suid" -- essentially a two-tuple `(source, identifier)` that uniquely and persistently identifies a metadata record in a source repository. This `identifier` may be any string value, provided by the external source.
 
 ### Conventions
 (an incomplete list)
 
-- functions prefixed `pls_` ("please") are a request for something to happen
+- local variables prefixed with underscore (to consistently distinguish between internal-only names and those imported/built-in)
+- prefer full type annotations in python code, wherever reasonably feasible
 
 ## Why this?
 inspired by [this writeup](https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html)

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,10 @@
 # Change Log
 
+# [25.5.1] - 2025-08-21
+- improve error handling in celery task-result backend
+- use logging config in celery worker
+- improve code docs (README.md et al.)
+
 # [25.5.0] - 2025-07-15
 - use python 3.13
 - use `poetry` to manage dependencies

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,7 +1,18 @@
 # CONTRIBUTING
 
-TODO: how do we want to guide community contributors?
+> note: this codebase is currently (and historically) rather entangled with [osf.io](https://osf.io), which has its shtrove at https://share.osf.io -- stay tuned for more-reusable open-source libraries and tools that should be more accessible to community contribution
 
-For now, if you're interested in contributing to SHARE/Trove, feel free to
+For now, if you're interested in contributing to SHARE/trove, feel free to
 [open an issue on github](https://github.com/CenterForOpenScience/SHARE/issues)
 and start a conversation.
+
+## Required checks
+
+All changes must pass the following checks with no errors:
+- linting: `python -m flake8`
+- static type-checking (on `trove/` code only, for now): `python -m mypy trove`
+- tests: `python -m pytest -x tests/`
+    - note: some tests require other services running -- if [using the provided docker-compose.yml](./how-to/run-locally.md), recommend running in the background (upping worker ups all: `docker compose up -d worker`) and executing tests from within one of the python containers (`indexer`, `worker`, or `web`):
+        `docker compose exec indexer python -m pytest -x tests/`
+
+All new changes should also avoid decreasing test coverage, when reasonably possible (currently checked on github pull requests).
diff --git a/README.md b/README.md
@@ -1,33 +1,17 @@
-# SHARE/Trove
+# SHARE/trove (aka SHARtrove, shtrove)
 
-SHARE is creating a free, open dataset of research (meta)data.
+> share (verb): to have or use in common.
 
-> **Note**: SHARE’s open API tools and services help bring together scholarship distributed across research ecosystems for the purpose of greater discoverability. However, SHARE does not guarantee a complete aggregation of searched outputs. For this reason, SHARE results should not be used for methodological analyses, such as systematic reviews.
+> trove (noun): a store of valuable or delightful things.
 
-[![Coverage Status](https://coveralls.io/repos/github/CenterForOpenScience/SHARE/badge.svg?branch=develop)](https://coveralls.io/github/CenterForOpenScience/SHARE?branch=develop)
+SHARE/trove (aka SHARtrove, shtrove) is is a service meant to store (meta)data you wish to keep and offer openly.
 
-## Documentation
+note: this codebase is currently (and historically) rather entangled with [osf.io](https://osf.io), which has its shtrove at https://share.osf.io -- stay tuned for more-reusable open-source libraries and tools for working with (meta)data
 
-### What is this?
-see [WHAT-IS-THIS-EVEN.md](./WHAT-IS-THIS-EVEN.md)
+see [ARCHITECTURE.md](./ARCHITECTURE.md) for help navigating this codebase
 
-### How can I use it?
-see [how-to/use-the-api.md](./how-to/use-the-api.md)
+see [CONTRIBUTING.md](./CONTRIBUTING.md) for info about contributing changes
 
-### How do I navigate this codebase?
-see [ARCHITECTURE.md](./ARCHITECTURE.md)
-
-### How do I run a copy locally?
-see [how-to/run-locally.md](./how-to/run-locally.md)
-
-
-## Running Tests
-
-### Unit test suite
-
-  py.test
-
-### BDD Suite
-
-  behave
+see [how-to/use-the-api.md](./how-to/use-the-api.md) for help using the api to add and access (meta)data
 
+see [how-to/run-locally.md](./how-to/run-locally.md) for help running a shtrove instance for local development
diff --git a/TODO.md b/TODO.md
@@ -0,0 +1,86 @@
+# TODO:
+ways to better this mess
+
+## better shtrove api experience
+
+- better web-browsing experience
+    - when `Accept` header accepts html, use html regardless of query-params
+        - when query param `acceptMediatype` requests another mediatype, display on page in copy/pastable way
+        - exception: when given `withFileName`, download without html wrapping
+        - exception: `/trove/browse` should still give hypertext with clickable links
+    - include more explanatory docs (and better fill out those explanations)
+    - more helpful (less erratic) visual design
+    - in each html rendering of an api response, include a `<form>` for adding/editing/viewing query params
+- better tsv/csv experience
+    - set default columns for `index-value-search` (and/or broadly improve `fields` handling)
+- better turtle experience
+    - quoted literal graphs also turtle
+    - omit unnecessary `^^rdf:string`
+- better jsonld experience
+    - provide `@context` (via header, at least)
+    - accept jsonld at `/trove/ingest` (or at each `ldp:inbox`...)
+
+
+## modular packaging
+move actually-helpful logic into separate packages that can be used and maintained independently of
+any particular web app/api/framework (and then use those packages in shtrove and osf)
+
+- `osfmap`: standalone OSFMAP definition
+    - define osfmap properties and shapes (following DCTAP) in static tsv files
+    - use `tapshoes` (below) to generate docs and helpful utility functions 
+    - may replace/simplify:
+        - `osf.metadata.osf_gathering.OSFMAP` (and related constants)
+        - `trove.vocab.osfmap`
+        - `trove.derive.osfmap_json`
+- `tapshoes`: for using and packaging [tabular application profiles](https://dcmi.github.io/dctap/) in python
+    - take a set of tsv/csv files as input
+        - should support any valid DCTAP (aim to be worth community interest)
+        - initial/immediate use case `osfmap`
+    - generate more human-readable docs of properties and shapes/types
+    - validate a given record (rdf graph) against a profile
+    - serialize a valid record in a consistent/stable way (according to the profile)
+    - enable publishing "official" application profiles as installable python packages
+    - learn from and consider using prior dctap work:
+        - dctap-python: https://pypi.org/project/dctap/
+            - loads tabular files into more immediately usable form
+        - tap2shacl: https://pypi.org/project/tap2shacl/
+            - builds shacl constraints from application profile
+            - could then validate a given graph with pyshacl: https://pypi.org/project/pyshacl/
+- metadata record crosswalk/serialization
+    - given a record (as rdf graph) and application profile to which it conforms (like OSFMAP), offer:
+        - crosswalking to a standard vocab (DCAT, schema.org, ...)
+        - stable rdf serialization (json-ld, turtle, xml, ...)
+        - special bespoke serialization (datacite xml/json, oai_dc, ...)
+    - may replace/simplify:
+        - `osf.metadata.serializers` 
+        - `trove.derive`
+- `shtrove`: reusable package with the good parts of share/trove
+    - python api and command-line tools
+    - given application profile
+    - digestive tract with pluggable storage/indexing interfaces
+    - methods for ingest, search, browse, subscribe
+- `django-shtrove`: django wrapper for `shtrove` functionality
+    - set application profile via django setting
+    - django models for storage, elasticsearch for indexing
+    - django views for ingest, search, browse, subscribe
+
+
+## open web standards
+- data catalog vocabulary (DCAT) https://www.w3.org/TR/vocab-dcat-3/
+    - an appropriate (and better thought-thru) vocab for a lot of what shtrove does
+    - already used in some ways, but would benefit from adopting more thoroughly
+        - replace bespoke types (like `trove:Indexcard`) with better-defined dcat equivalents (like `dcat:CatalogRecord`)
+        - rename various properties/types/variables similarly
+            - "catalog" vs "index"
+            - "record" vs "card"
+        - replace checksum-iris with `spdx:checksum` (added in dcat 3)
+- linked data notifications (LDN) https://www.w3.org/TR/ldn/
+    - shtrove incidentally (partially) aligns with linked-data principles -- could lean into that
+    - replace `/trove/ingest` with one or more `ldp:inbox` urls
+    - trove index-card like an inbox containing current/past resource descriptions
+        ```
+        <://osf.example/blarg> ldp:inbox <://shtrove.example/index-card/0000-00...> .
+        <://shtrove.example/index-card/0000-00...> ldp:contains <://shtrove.example/description/0000-00...> .
+        <://shtrove.example/description/0000-00...> foaf:primaryTopic <://osf.example/blarg>
+        ```
+        (might consider renaming "index-card" for consistency/clarity)