-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Derivative Rodeo Handling Pre-existing Files that are not normally generated via Derivative Rodeo generators #251
Comments
jeremyf
added a commit
to notch8/derivative_rodeo
that referenced
this issue
Jun 6, 2023
Yes, it would be nice to have sub-directories for pages ripped from a PDF. However, that aspirational state creates complications on the implementation details of the DerivativeRodeo; by moving from a tail glob to a regular expression, we create a more powerful mechanism for finding files. These changes also highlighted a few implementation bugs (namely ensuring the correct expected return value of the newly re-named function). In IIIF Print we still need to consider how to find the child page's derivatives of an original PDF. That is, however, not a problem for this repository. Related to: - notch8/iiif_print#251 Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf
added a commit
to notch8/derivative_rodeo
that referenced
this issue
Jun 6, 2023
Yes, it would be nice to have sub-directories for pages ripped from a PDF. However, that aspirational state creates complications on the implementation details of the DerivativeRodeo; by moving from a tail glob to a regular expression, we create a more powerful mechanism for finding files. These changes also highlighted a few implementation bugs (namely ensuring the correct expected return value of the newly re-named function). In IIIF Print we still need to consider how to find the child page's derivatives of an original PDF. That is, however, not a problem for this repository. Related to: - notch8/iiif_print#251 Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf
added a commit
that referenced
this issue
Jun 6, 2023
This commit leverages the conventions established in the DerivativeRodeo around where we're writing split pages and their derivatives. The inline comments that are written/amended with this PR should be read closely for clarity and intention. Related to: - #251 - notch8/derivative_rodeo#48 Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf
added a commit
that referenced
this issue
Jun 6, 2023
This commit leverages the conventions established in the DerivativeRodeo around where we're writing split pages and their derivatives. The inline comments that are written/amended with this PR should be read closely for clarity and intention. Related to: - #251 - notch8/derivative_rodeo#48 Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf
added a commit
that referenced
this issue
Jun 6, 2023
This commit leverages the conventions established in the DerivativeRodeo around where we're writing split pages and their derivatives. The inline comments that are written/amended with this PR should be read closely for clarity and intention. Related to: - #251 - notch8/derivative_rodeo#48 Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf
added a commit
that referenced
this issue
Jun 6, 2023
This commit leverages the conventions established in the DerivativeRodeo around where we're writing split pages and their derivatives. The inline comments that are written/amended with this PR should be read closely for clarity and intention. Related to: - #251 - notch8/derivative_rodeo#48 Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf
added a commit
that referenced
this issue
Jun 7, 2023
This commit leverages the conventions established in the DerivativeRodeo around where we're writing split pages and their derivatives. The inline comments that are written/amended with this PR should be read closely for clarity and intention. Related to: - #251 - notch8/derivative_rodeo#48 Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf
added a commit
that referenced
this issue
Jul 7, 2023
This commit leverages the conventions established in the DerivativeRodeo around where we're writing split pages and their derivatives. The inline comments that are written/amended with this PR should be read closely for clarity and intention. Related to: - #251 - notch8/derivative_rodeo#48 Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Overview:
Existing files are Set
A
. The generated files are SetB
. We want to accountA ∪ B
Given a one page PDF (named
basename.pdf
) with the parent work's identifer ofwork_id
. Thebasename.pdf
that has the following existing derivatives:basename.txt
)basename.reader.pdf
)basename.jpeg
)When we pre-process the files
Then we will have the following files in the given locations:
s3://work_id/basename.pdf
s3://work_id/basename.thumbnail.jpeg
(we copiedbasename.jpeg
to this location)s3://work_id/basename.txt
s3://work_id/basename.reader.pdf
s3://work_id/basename--page-1.tiff
s3://work_id/basename--page-1.thumbnail.jpeg
s3://work_id/basename--page-1.txt
s3://work_id/basename--page-1.alto.xml
s3://work_id/basename--page-1.coordinates.json
Given the above pre-processed files
When we ingest the
basename.pdf
(via the OAI feed)Then we will have the following resulting data:
work_id
As of <2023-06-06 Tue>, the IIIF Print Derivative Rodeo service does not account for additional derivatives that might have been "pre-processed". That is to say, the derivative service does not know how to generate the PDF Plain Text, nor does it know to go looking for it.
In the above scenarios, when we're processing the derivatives for the Work FileSet we need to look for additional files that were not generated via generators but instead brought over from pre-existing storage. We would use the copy generator and some kind of glob selector on the files.
The text was updated successfully, but these errors were encountered: