forked from starlinglab/integrity-v2
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f169f70
commit abcf3aa
Showing
1 changed file
with
88 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
# Folder Preprocessor | ||
|
||
The Folder Preprocessor recursively monitors file changes within a directory and submits the file and its metadata to a webhook if the file meets specific criteria. | ||
|
||
This preprocessor must be run with a PostgreSQL database. Users are expected to set up Project Metadata in the admin database before starting to sync the folders. | ||
|
||
## Setup | ||
|
||
### Config | ||
|
||
The following is the required configuration in TOML. | ||
|
||
#### FolderPreprocessor | ||
|
||
`SyncFolderRoot` must be set to define which directory should be watched for file changes. | ||
|
||
#### FolderDatabase | ||
|
||
Credentials to connect to a PostgreSQL Database. The `Database` must be created beforehand. | ||
|
||
### Webhook | ||
|
||
`Host` should be set to where a webhook service is hosted. `Jwt` should be a pre-shared secret for authentication to the webhook. | ||
|
||
### Database | ||
|
||
#### Project Metadata (`project_metadata`) | ||
|
||
This table allows users to define `project_id` and corresponding `project_path` file path (relative to the sync root directory) to watch for new files. `file_extensions` also allow users to only watch for specific file extensions. | ||
|
||
If a file does not belong to any project or is not in `file_extensions` (if set), it is not handled by the folder preprocessor. | ||
|
||
| Column | Description | | ||
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------- | | ||
| project_id | A unique identifier of the project, preferably human-readable. | | ||
| project_path | The path where files are located relative to the sync folder root, e.g., `/project-id`. | | ||
| author_type | (optional) Author type according to [Schema.org/author](https://schema.org/author). | | ||
| author_name | (optional) Author name according to [Schema.org/author](https://schema.org/author). | | ||
| author_identifier | (optional) Author identifier according to [Schema.org/author](https://schema.org/author). | | ||
| file_extensions | (optional) A list of allowed file extensions to watch for. If not set, all file extensions are watched. e.g., `.gif,.jpg,.jpeg`. | | ||
|
||
#### File Status | ||
|
||
This table stores all files that are found and match the criteria. | ||
|
||
| Column | Description | | ||
| ---------- | ----------------------------------------------------------------------------------------------------------------------------- | | ||
| id | Auto-increment unique ID, not used by humans. | | ||
| file_path | The file path where this file is found, unique key. | | ||
| status | Status of the file, can be one of `Found`, `Uploading`, `Success`, `Error`. | | ||
| error | If the status is `Error`, the error message is stored here. | | ||
| cid | The resulting content ID if the status is `Success`. Only the first CID is recorded if the file induces more than one upload. | | ||
| created_at | The time the file is first found. | | ||
| updated_at | The time of the last status update. | | ||
|
||
## Usage | ||
|
||
When the preprocessor is first launched, it will scan all `project_paths` under the sync root folder. Any files found and matching the handling criteria will be processed. | ||
|
||
After that, the preprocessor recursively watches for changes under the same paths and handles any new or renamed files according to the criteria. | ||
|
||
The preprocessor keeps track of discovered file statuses in the database. Any file (identified by file path) that has a status other than `Found` will not be re-processed. The CID is set for the `Success` status and the `error` is set for the `Error` status. | ||
|
||
## Ignored Files (Criteria for file to be processed) | ||
|
||
For a file to be processed by this preprocessor, it must fit certain criteria: | ||
|
||
- It must be under the `project_path` of a project, i.e. it must belong to a project. | ||
- If `file_extensions` is set in the project, it must have an allowed extension. | ||
- The file name must not begin with `.` (i.e. no hidden files). | ||
- The file name must not end with `.partial`. | ||
- The file status must not be `Error` or `Complete`. | ||
|
||
## Special File Types | ||
|
||
### ProofMode | ||
|
||
This preprocessor parses `ProofMode` bundles. If a `.zip` file contains a `HowToVerifyProofData.txt` at the root, it is treated as a `ProofMode` bundle. All signatures and hashes will be verified. | ||
|
||
If a `ProofMode` bundle contains more than one asset, all assets will be verified and individually uploaded to the webhook. Additionally, a special attribute `proofmode` will be set to contain signature-related information. | ||
|
||
### WACZ | ||
|
||
This preprocessor parses `wacz` bundles. If a zip file with the `.wacz` extension contains a `datapackage.json` at the root, it is treated as a `wacz` bundle. | ||
|
||
This preprocessor supports verifying both anonymously signed and domain-signed `wacz` bundles. For anonymous bundles, we don't have a trusted list of keys, but a public key whitelist will be introduced soon. For domain signed bundle, a hardcode list of Let's Encrypt root cert, and freetsa root cert for RFC3161 verification is whitelisted. | ||
|
||
Also a special attribute wacz would be set to contain signature related information. |