Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derivative Rodeo Handling Pre-existing Files that are not normally generated via Derivative Rodeo generators #251

Open
jeremyf opened this issue Jun 6, 2023 · 0 comments

Comments

@jeremyf
Copy link
Contributor

jeremyf commented Jun 6, 2023

Overview:

Existing files are Set A. The generated files are Set B. We want to account A ∪ B

Given a one page PDF (named basename.pdf) with the parent work's identifer of work_id. The basename.pdf that has the following existing derivatives:

  • Plain text file (basename.txt)
  • Reader file (basename.reader.pdf)
  • Thumbnail (basename.jpeg)

When we pre-process the files
Then we will have the following files in the given locations:

  • PDF Original: s3://work_id/basename.pdf
  • PDF Thumbnail: s3://work_id/basename.thumbnail.jpeg (we copied basename.jpeg to this location)
  • PDF Plain Text: s3://work_id/basename.txt
  • PDF Reader: s3://work_id/basename.reader.pdf
  • Page Image: s3://work_id/basename--page-1.tiff
  • Page's thumbnail: s3://work_id/basename--page-1.thumbnail.jpeg
  • Page's plain text: s3://work_id/basename--page-1.txt
  • Page's alto xml: s3://work_id/basename--page-1.alto.xml
  • Page's word coordinates: s3://work_id/basename--page-1.coordinates.json

Given the above pre-processed files
When we ingest the basename.pdf (via the OAI feed)
Then we will have the following resulting data:

  • Work: A "work" with identifier of work_id
  • Work FileSet: with parent of Work and the PDF Original as the original file, and PDF Thumbnail, PDF Plain Text, and PDF Reader as derived files.
  • Page: A "work" with parent of Work
  • Page FileSet: with parent of Page and the Page Image as the original file, with Page's Thumbnail, Pages' plain text, Page's alto xml, and Page's word coordinates as derived files.

As of <2023-06-06 Tue>, the IIIF Print Derivative Rodeo service does not account for additional derivatives that might have been "pre-processed". That is to say, the derivative service does not know how to generate the PDF Plain Text, nor does it know to go looking for it.

In the above scenarios, when we're processing the derivatives for the Work FileSet we need to look for additional files that were not generated via generators but instead brought over from pre-existing storage. We would use the copy generator and some kind of glob selector on the files.

jeremyf added a commit to notch8/derivative_rodeo that referenced this issue Jun 6, 2023
Yes, it would be nice to have sub-directories for pages ripped from a
PDF.  However, that aspirational state creates complications on
the implementation details of the DerivativeRodeo; by moving from a tail
glob to a regular expression, we create a more powerful mechanism for
finding files.

These changes also highlighted a few implementation bugs (namely
ensuring the correct expected return value of the newly re-named
function).

In IIIF Print we still need to consider how to find the child page's
derivatives of an original PDF.  That is, however, not a problem for
this repository.

Related to:

- notch8/iiif_print#251

Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf added a commit to notch8/derivative_rodeo that referenced this issue Jun 6, 2023
Yes, it would be nice to have sub-directories for pages ripped from a
PDF.  However, that aspirational state creates complications on
the implementation details of the DerivativeRodeo; by moving from a tail
glob to a regular expression, we create a more powerful mechanism for
finding files.

These changes also highlighted a few implementation bugs (namely
ensuring the correct expected return value of the newly re-named
function).

In IIIF Print we still need to consider how to find the child page's
derivatives of an original PDF.  That is, however, not a problem for
this repository.

Related to:

- notch8/iiif_print#251

Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf added a commit that referenced this issue Jun 6, 2023
This commit leverages the conventions established in the DerivativeRodeo
around where we're writing split pages and their derivatives.

The inline comments that are written/amended with this PR should be read
closely for clarity and intention.

Related to:

- #251
- notch8/derivative_rodeo#48

Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf added a commit that referenced this issue Jun 6, 2023
This commit leverages the conventions established in the DerivativeRodeo
around where we're writing split pages and their derivatives.

The inline comments that are written/amended with this PR should be read
closely for clarity and intention.

Related to:

- #251
- notch8/derivative_rodeo#48

Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf added a commit that referenced this issue Jun 6, 2023
This commit leverages the conventions established in the DerivativeRodeo
around where we're writing split pages and their derivatives.

The inline comments that are written/amended with this PR should be read
closely for clarity and intention.

Related to:

- #251
- notch8/derivative_rodeo#48

Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf added a commit that referenced this issue Jun 6, 2023
This commit leverages the conventions established in the DerivativeRodeo
around where we're writing split pages and their derivatives.

The inline comments that are written/amended with this PR should be read
closely for clarity and intention.

Related to:

- #251
- notch8/derivative_rodeo#48

Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf added a commit that referenced this issue Jun 7, 2023
This commit leverages the conventions established in the DerivativeRodeo
around where we're writing split pages and their derivatives.

The inline comments that are written/amended with this PR should be read
closely for clarity and intention.

Related to:

- #251
- notch8/derivative_rodeo#48

Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
jeremyf added a commit that referenced this issue Jul 7, 2023
This commit leverages the conventions established in the DerivativeRodeo
around where we're writing split pages and their derivatives.

The inline comments that are written/amended with this PR should be read
closely for clarity and intention.

Related to:

- #251
- notch8/derivative_rodeo#48

Co-authored-by: Kirk Wang <kirk.wang@scientist.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant