-
Notifications
You must be signed in to change notification settings - Fork 60
Add explanatory docs on catalog database #713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
ca8c072
Add explanatory docs on catalog database
danielballan 5cd0adf
Cover the revisions table
danielballan 49becf8
Apply suggestions from code review
danielballan 27bc692
Expand on some aspects.
danielballan 4cc78e9
Add overview and diagram
danielballan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,291 @@ | ||
# Catalog Database | ||
|
||
The Catalog database is a SQL database of information describing data: its | ||
name, metadata, structure, format, and location. | ||
|
||
## Nodes | ||
|
||
The `nodes` table is the _logical_ view of the data, the way that Tiled | ||
presents the data to clients. Each row represents one node in the logical | ||
"tree" of data represented by Tiled. | ||
|
||
- `metadata` --- user-controlled JSON object, with arbitrary metadata | ||
- `ancestors` and `key` --- together specify the unique path of the data | ||
- `structre_family` --- enum of structure types (`"container"`, `"array"`, `"table"`, ...) | ||
- `specs` --- user-controlled JSON list of specs, such as `[{"name": "XDI", "version": "1"}]` | ||
- `id` an internal integer primary key, not exposed by the API | ||
- `time_created` and `time_updated` --- for forensics, not exposed by the API | ||
|
||
## Data Source | ||
|
||
Each Data Source is associated with one Node. Together, `data_sources`, `structures`, | ||
and `assets`, describes the format, structure, and location of the data. | ||
|
||
- `mimetype` --- MIME type string describing the format, such as `"text/csv"` | ||
(This is used by Tiled to identify a suitable Adapter to read this data.) | ||
- `parameters` --- JSON object with additional parameters that will be passed | ||
to the Adapter | ||
- `management` --- enum indicating whether the data is registered `"external"` data | ||
or `"writable"` data managed by Tiled | ||
- `structure_family` --- enum of structure types (`"container"`, `"array"`, `"table"`, ...) | ||
- `structure_id` --- a foreign key to the `structures` table | ||
- `node_id` --- foreign key to `nodes` | ||
- `id` --- integer primary key | ||
- `time_created` and `time_updated` --- for forensics, not exposed by the API | ||
|
||
## Structure | ||
|
||
Each Data Source references exactly one Structure. | ||
|
||
- `structure` --- JSON object describing the structure | ||
- `id` --- MD5 hash of the | ||
[RFC 8785](https://www.rfc-editor.org/rfc/rfc8785) canonical JSON of the structure | ||
|
||
|
||
## Asset | ||
|
||
- `data_uri` --- location of data, given as `file://localhost/PATH` | ||
(It is planned to extend to schemes other than `file`, such as `s3`, in the | ||
future.) | ||
- `is_directory` --- boolean | ||
- `hash_type` and `hash_content` --- not yet implemented (i.e. always NULL) but | ||
intended for content verification | ||
- `size` --- not yet implemented (i.e. always NULL) but intended to support | ||
fast queries for data size estimation | ||
- `id` --- integer primary key | ||
- `time_created` and `time_updated` --- for forensics, not exposed by the API | ||
|
||
## Data Source Asset Relation | ||
|
||
Assets and Data Sources have a many-to-many relation. The | ||
`data_source_asset_assocation` table is best described by the example below. | ||
|
||
- `data_source_id`, `asset_id` --- foreign keys | ||
- `parameter` --- the name of the parameter this Asset should be passed to | ||
genematx marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `num` --- the position of this item in a list | ||
|
||
If `parameter` is NULL, the Asset is a supporting file, not passed directly to | ||
the Adapter. | ||
|
||
If `num` is NULL, the Adapter will be passed a scalar value. If `num` is an | ||
integer, the Adapter will be passed a list sorted by `num`. | ||
|
||
Database triggers are used to ensure self-consistency. | ||
`time_created` and `time_updated` contain timestamps related to the corresponding DB entry (Node, Data Source, Asset), and not the underlying data files. | ||
### Single HDF5 file | ||
|
||
This is a simple example: one Data Source and one associated Asset. | ||
|
||
```sql | ||
select id, mimetype, parameters from data_sources; | ||
``` | ||
|
||
id | mimetype | parameters | | ||
-- | -- | -- | ||
1 | "application/x-hdf5" | {"smwr": true} | ||
|
||
|
||
```sql | ||
select data_uri, is_diretory from assets | ||
``` | ||
|
||
id | data_uri | is_directory | ||
-- | -- | -- | ||
1 | "file://localhost/path/to/data.h5" | false | ||
|
||
The HDF5 Adapter takes one HDF5 file passed to the argument | ||
named `data_uri`, so the Asset is given parameter `"data_uri"` | ||
and num `NULL`. | ||
|
||
```sql | ||
select * from data_source_asset_assocation | ||
``` | ||
|
||
data_source_id | asset_id | parameter | num | ||
-- | -- | -- | -- | ||
1 | 1 | "data_uri" | NULL | ||
|
||
### Single Zarr directory | ||
|
||
This is similar. A single Zarr dataset is backed by a directory, not a | ||
file. The internal structure of the directory is managed by Zarr, not by the | ||
user, so Tiled can simply track the whole directory as a unit, not each | ||
individual file. | ||
|
||
```sql | ||
select id, mimetype, parameters from data_sources; | ||
``` | ||
|
||
id | mimetype | parameters | | ||
-- | -- | -- | ||
1 | "application/x-zarr" | {} | ||
|
||
|
||
```sql | ||
select data_uri, is_diretory from assets | ||
``` | ||
|
||
id | data_uri | is_directory | ||
-- | -- | -- | ||
1 | "file://localhost/path/to/data.zarr" | true | ||
|
||
(Notice `is_directory` is `true`.) | ||
|
||
```sql | ||
select * from data_source_asset_assocation | ||
``` | ||
|
||
data_source_id | asset_id | parameter | num | ||
-- | -- | -- | -- | ||
1 | 1 | "data_uri" | NULL | ||
|
||
### Single TIFF Image | ||
|
||
This is another simple example, very much like the HDF5 example. | ||
|
||
```sql | ||
select id, mimetype, parameters from data_sources; | ||
``` | ||
|
||
id | mimetype | parameters | | ||
-- | -- | -- | ||
1 | "image/tiff" | {} | NULL | ||
|
||
|
||
```sql | ||
select data_uri, is_diretory from assets | ||
``` | ||
|
||
id | data_uri | is_directory | ||
-- | -- | -- | ||
1 | "file://localhost/path/to/image.tiff" | false | ||
|
||
```sql | ||
select * from data_source_asset_assocation | ||
``` | ||
|
||
data_source_id | asset_id | parameter | num | ||
-- | -- | -- | -- | ||
1 | 1 | "data_uri" | NULL | ||
|
||
### TIFF sequence | ||
|
||
Now we have a sequence of separate TIFF files (`image00001.tiff`, | ||
`image00002.tiff`, ...) that we want to treat as a single Data Source. | ||
|
||
```sql | ||
select id, mimetype, parameters from data_sources; | ||
``` | ||
|
||
id | mimetype | parameters | | ||
-- | -- | -- | ||
1 | "multipart/related;type=image/tiff" | {} | ||
|
||
The MIME type `multipart/related;type=image/tiff` is registered to an Adapter | ||
that expects a _sequence_ of TIFF files, e.g. `TiffSequenceAdapter`. | ||
|
||
genematx marked this conversation as resolved.
Show resolved
Hide resolved
|
||
```sql | ||
select data_uri, is_diretory from assets | ||
``` | ||
|
||
id | data_uri | is_directory | ||
-- | -- | -- | ||
1 | "file://localhost/path/to/image00001.tiff" | false | ||
2 | "file://localhost/path/to/image00002.tiff" | false | ||
3 | "file://localhost/path/to/image00003.tiff" | false | ||
|
||
```sql | ||
select * from data_source_asset_assocation | ||
``` | ||
|
||
data_source_id | asset_id | parameter | num | ||
-- | -- | -- | -- | ||
1 | 1 | "data_uris" | 0 | ||
1 | 2 | "data_uris" | 1 | ||
1 | 3 | "data_uris" | 2 | ||
|
||
### Single CSV file | ||
|
||
The CSV Adapter is designed to accept multiple CSV partitions | ||
representing batches (a.k.a. partitions) of rows. | ||
|
||
```sql | ||
select id, mimetype, parameters from data_sources; | ||
``` | ||
|
||
id | mimetype | parameters | | ||
-- | -- | -- | ||
1 | "text/csv" | {} | NULL | ||
|
||
|
||
```sql | ||
select data_uri, is_diretory from assets | ||
``` | ||
|
||
id | data_uri | is_directory | ||
-- | -- | -- | ||
1 | "file://localhost/path/to/table.csv" | false | ||
|
||
The CSV Adapter takes one or more CSV passed as a list to the | ||
argument named `data_uris`, so the Asset is given parameter | ||
`data_uris` and num `0`. | ||
|
||
```sql | ||
select * from data_source_asset_assocation | ||
``` | ||
|
||
data_source_id | asset_id | parameter | num | ||
-- | -- | -- | -- | ||
1 | 1 | "data_uris" | 0 | ||
|
||
### HDF5 file with virtual datasets | ||
|
||
Here is an example where we set parameter to NULL. | ||
|
||
```sql | ||
select id, mimetype, parameters from data_sources; | ||
``` | ||
|
||
id | mimetype | parameters | | ||
-- | -- | -- | ||
1 | "application/x-hdf5" | {} | ||
|
||
|
||
```sql | ||
select data_uri, is_diretory from assets | ||
``` | ||
|
||
id | data_uri | is_directory | ||
-- | -- | -- | ||
1 | "file://localhost/path/to/master.h5" | false | ||
2 | "file://localhost/path/to/data00001.h5" | false | ||
3 | "file://localhost/path/to/data00002.h5" | false | ||
4 | "file://localhost/path/to/data00003.h5" | false | ||
|
||
The CSV Adapter takes one or more CSV passed as a list to the | ||
argument named `data_uris`, so the Asset is given parameter | ||
`data_uris` and num `0`. | ||
|
||
```sql | ||
select * from data_source_asset_assocation | ||
``` | ||
|
||
data_source_id | asset_id | parameter | num | ||
-- | -- | -- | -- | ||
1 | 1 | "data_uri" | NULL | ||
1 | 2 | NULL | NULL | ||
1 | 3 | NULL | NULL | ||
1 | 4 | NULL | NULL | ||
|
||
## Revisions | ||
|
||
The `revisions` table stores snapshots of Node `metadata` and `specs`. When an | ||
update is made, the row in the `nodes` table is updated and a _copy_ with the | ||
original content is inserted in the `revisions` table. | ||
|
||
- `node_id` --- foreign key to the node | ||
- `revision_number` --- integer counting revisions of this node from 1 | ||
- `metadata` --- snapshot of node metadata | ||
- `specs` --- snapshot of node specs | ||
- `id` --- an internal integer primary key, not exposed by the API | ||
- `time_created` and `time_updated` --- for forensics, not exposed by the API |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.