-
Notifications
You must be signed in to change notification settings - Fork 4
Dataflow
Merritt's dataflow (see the diagram below) is identical whether content arrives through the UI (1) or via the SWORD endpoint (1a). The Ingest service contacts the Local ID service (2) to match up any local IDs submitted with the content against existing ARKs, and/or create new ARK-to-local-ID mappings; these mappings are stored in the Inventory database (3).
For a high-level overview of the various services, see the Architecture page. For the complete ingest, storage, and replication process, see the Ingest Process page.
The Ingest service then pushes a manifest (4) of staged content to the Storage service, which pulls the staged content (5) to its own local storage and pushes it (6) to its primary storage node. When this process completes, the Ingest service pushes the storage URL for the object manifest to the Inventory queue (7), creating an inventory job.
To process the job, the Inventory service pulls the manifest storage URL
from the queue (8) and uses it to pull first the manifest (9), and then (based
on the manifest) the object's system metadata (the files in the object's
system
directory, as well as various files in the producer
directory
with the mrt-
prefix, if present). The information in these files is used
to populate the Inventory database (10).
The Replication service scans the Inventory database (11) for objects that need to be replicated, pulls the content from each object's primary storage node (11a), pushes it to the object's secondary storage node, and updates the object's replication status in the database.
The Audit service, similarly, continually scans the Inventory database (12) for files that has never been audited, or for the least-recently-audited files, which it pulls (12a) from their storage nodes (both primary and secondary). After recalculating the hash of each file, it writes the updated audit status to the database.