Skip to content

Streams and their files

John Mark Ockerbloom edited this page Mar 4, 2022 · 12 revisions

Streams and their files (DRAFT)

Streams represent a set of records associated with an organization. For the POD-Reshare project, the default stream is assumed to represent the full holdings an organization intends to expose to Reshare and other peers. The full set of holdings is represented by the files in the stream that contain new records, changed records, and deleted records, processed in the order in which the files were uploaded. (Times of upload are available in the stream listings.)

The order of processing of multiple files embedded in the same uploaded file (such as a tar or zip file) is not well-defined. We therefore recommend only including a single file in an uploaded file's package unless the uploader can be sure that none of the files have any record IDs in common. For processing efficiency, we also recommend that file packagings be processable as a stream (gzipped files are; tar files might not be), so that they do not need to be completely unpacked or parsed before they can be processed.

Full dumps and incremental updates

Normally, the oldest files in a stream represent a full dump, and later files indicate incremental additions, changes, and deletions. There is, however, no technical difference between files uploaded as part of a full dump, and files uploaded as incremental updates. Accepted formats for uploaded files can be found under Data requirements.

A default stream should not contain only an incremental update. If it does, the incremental update will be interpreted as a full dump. Incremental updates to a full dump should be placed in the same stream as the full dump to be properly understood.

Names of files in streams are considered reusable labels. If a file is uploaded with the same name as a file already in the stream, it will be considered a new file unrelated to the old one, and processed separately based on its arrival date. However, for clarity we recommend that new files be given unique names in the stream if feasible.

A second full dump placed into a stream may not be understood as expected if IDs for records in previous files and not in the new full dump are not explicitly deleted. For example, a full dump with IDs 1, 2, 3, and 4, followed by a full dump with IDs 1, 2, and 5, will be understood as a set of records with IDs 1, 2, 3, 4, and 5 (where the contents of the records with IDs 1, 2, and 5 match their contents in the second full dump, and the contents of the records with IDs 3 and 4 match their contents in the first full dump.) Adding a file deleting IDs 3 and 4 will make the stream understood as the set of records in the second full dump. New full dumps may also be uploaded to a new stream, with that stream then made as the default, without requiring deletions.

Errors

There is currently no guarantee on how files with errors (that is, with data that cannot be properly processed) will be interpreted. They may be entirely ignored; some of the data but not all of it may be incorporated into normalized dumps or other exports; some or all of the data may be incorporated but interpreted differently than the uploader intends; or some or all of the data might be processed later than the data in previously uploaded files. The POD project may need to clarify the semantics of processing files with errors in the future.

For now, uploaders with files that they are not sure will be processed as intended may want to test them by uploading them to a non-default stream and seeing how they are processed. If files with errors are uploaded to a default stream, uploaders may want to see how they have been interpreted in normalized files, and then upload further updates to correct any missed or misinterpreted changes. If all else fails, a new full dump can be made to a new stream and that stream then made the default.