Matching PAGE imageFilename to mets:file when imageFilename is not a URL #771

kba · 2018-08-30T13:51:17Z

kba
Aug 30, 2018
Maintainer

Scenario:

Image files and PAGE referencing those image files by relative filepath:
```
<Page imageFilename="foo.tif"/>
```

Create a METS file and run workspace add:

<mets:file GROUPID="page0001" xlink:href="file://path/to/bla/foo.tif"

Now the PAGE imageFilename and xlink:href of the corresponding mets:file do not match anymore.

kba · 2018-08-31T13:33:22Z

kba
Aug 31, 2018
Maintainer Author

Solution 1) On workspace add, change the imageFilename of the PAGE.

Solution 2) Have a in place to map from @imageFilename to @xlink:href (like "match if fileName is suffix to a mets:file@xlink:href" or "match if fileName is GROUPID of a page and there is a mets:file with mimetype image/* with that GROUPID", this could be automated.)

Solution 3) Track the relation externally. A mechanism like this will be necessary anyway because of the problem of local URLs irreversibly replacing remote URLs when downloading files.

0 replies

wrznr · 2018-08-31T13:34:18Z

wrznr
Aug 31, 2018

Without overseeing the technical consequences: Only a cosmetic nastiness? I am not sure we ever touch the file refs in PAGE, do we?

0 replies

kba · 2018-08-31T13:38:46Z

kba
Aug 31, 2018
Maintainer Author

See https://ocr-d.github.io/page#url-for-imagefilename--filename

The imageFilename is necessary to get from page to the mets:file that represents the image.

0 replies

wrznr · 2018-08-31T13:40:53Z

wrznr
Aug 31, 2018

Ah, okay. In this case, 👍 for solution 3.

0 replies

VolkerHartmann · 2018-09-04T07:55:38Z

VolkerHartmann
Sep 4, 2018

Solution 1 only works if workspace add is used which may be a drawback.
Solution 2 sounds complex. There may be several images for the same page (orig, binarized, cropped, deskewed,...)
Solution 3 works out of the box analyzing whole METS and referenced PAGEs in one step. This may be done each time an export/import is planned.
My vote for solution 3.

0 replies

kba · 2018-09-04T08:24:13Z

kba
Sep 4, 2018
Maintainer Author

Solution 1 only works if workspace add is used which may be a drawback.

Solution 3 works out of the box analyzing whole METS and referenced PAGEs in one step.

Solution 3 reasonably also only works on workspace add since this has to be an external file in the workspace (currently I'm using url-aliases.csv). It could be populated by hand or external mechanism but then again, so could you change the PAGE by hand (or with sed).

0 replies

bertsky · 2018-09-08T20:28:58Z

bertsky
Sep 8, 2018
Collaborator

Pardon my being slow-witted, but what was the reason for https://ocr-d.github.io/page#url-for-imagefilename--filename (always requiring a URL, even if local) in the first place? Why not use relative paths (without file:// scheme)?

I thought the workspace metaphor would work like a DVCS repository. But if we require URLs everywhere, I cannot move my workspaces around in the filesystem. Am I supposed to clone -l instead?

(BTW, pack / unpack should also beware of file URLs.)

0 replies

kba · 2018-09-10T12:08:19Z

kba
Sep 10, 2018
Maintainer Author

what was the reason for ocr-d.github.io/page#url-for-imagefilename--filename (always requiring a URL, even if local) in the first place? Why not use relative paths (without file:// scheme)?

The original plan was to completely forgo the filesystem and use a repository for all intermediate results, not just of workflow runs but of individual processors
(hence the file resolver and cache etc.). Processors were to download the data by URL, do their thing, upload the data and set URL. file:// URL or relative paths
should be avoided because having them manifest in the data makes is error-prone when tasks are to be distributed, parallelized etc., in a workflow.

The workspace is the place where processors "do their thing", a mere implementation-specific helper for a processor. We considered the mets.xml to be the single
source of truth for all data and metadata, it should always be enough to have that mets.xml and access all files via their persisten HTTP URL.

Nowadays, full provenance and reproducibility of every single step is not our top priority anymore. This allows us making that workspace/Git-like approach a first-
class concept. We should adapt the specs to reflect this.

0 replies

bertsky · 2018-09-10T13:18:38Z

bertsky
Sep 10, 2018
Collaborator

Thanks, it makes sense to me now. But what still escapes me is the logic of:

file://URLs or relative paths
should be avoided because having them manifest in the data makes is error-prone when tasks are to be distributed, parallelized etc.

I completely agree as far as file:// URLs are concerned, but relative paths? Isn't that manifestation the best way to make a distributed system thrive (as DVCS success shows) and scale? Requiring all computation to do I/O via URLs incurs a huge bottleneck and hinders parallelization (due to synchronization effort). Even with a distributed file system (which is an alternative to URLs with client-server transfer protocols) I would recommend allowing intermediate I/O to be local (temporary).

Anyway, if I understand you correctly, you will move towards allowing local intermediate steps and workspaces as true DVCS. Can I conclude from that relative file names will be your preferred solution for this issue, too? (Or am I misreading your explanation?)

0 replies

kba · 2018-09-10T15:41:11Z

kba
Sep 10, 2018
Maintainer Author

Isn't that manifestation the best way to make a distributed system thrive (as DVCS success shows) and scale?

In mass digitisation we cannot assume that mets.xml and referenced data are on the same FS (workspacec/dvcs metaphor) so the mets.xml acts more as a manifest.

Requiring all computation to do I/O via URLs incurs a huge bottleneck and hinders parallelization

Of course you need some form of caching on a local filesystem. Hence the workspace: Create a local folder with all required files for a processor to work on. In fact that was why originally those were created in /tmp because it was mounted in RAM and hence fast.

But once the local processes are complete, ensure that all data is stored persistently and no references to local files remain. You need to do that I/O at some point, download all the files, keep track which local file represents which file URL, and in the end store it somewhere persistenly.

Can I conclude from that relative file names will be your preferred solution for this issue, too?

No, I would still prefer URL to be used in the data. The best way to avoid having references to local-only data is not to persist it. Instead, I'd be for a mechanism to map local filenames to opaque identifiers, such as a URL or whatever string is in the imageFilename of a PAGE-XML etc.

0 replies

bertsky · 2018-09-10T17:07:00Z

bertsky
Sep 10, 2018
Collaborator

In mass digitisation we cannot assume that mets.xml and referenced data are on the same FS

Sorry, I somehow forgot about that (it seems strange to me now, too). Then the DVCS metaphor is perhaps misleading.

So how about this new scheme: A workspace is nothing but an identical copy of the remote mets.xml (using only public URLs) plus the files in relative paths of the local FS – by the same path name convention as ocrd-zip (or something that does not require changing the imageFilename and filename of PAGE-XML to mets:fileGrp USE directory and mets:file ID filename).

But once the local processes are complete, ensure that all data is stored persistently and no references to local files remain. You need to do that I/O at some point, download all the files, keep track which local file represents which file URL, and in the end store it somewhere persistenly.

Yes, understood. In the above scheme, there would be no local reference (to keep track of) any more. So as soon as a new files gets added, one would be required to provide a URL for it. Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.

0 replies

kba · 2018-09-11T12:33:02Z

kba
Sep 11, 2018
Maintainer Author

Then the DVCS metaphor is perhaps misleading.

I think it is useful for individual processes, as an abstraction for implementers and for workflow/archiving purposes. But from the perspective of a digitisation engineer/sysadmin, it's best to assume:

Data is not available locally
Files cannot be changed only new files added
Only mets.xml, command line parameters and terminal input/output determine the results
Don't expect (legacy) workflows to produce input data that adheres to every convention

the same path name convention as ocrd-zip (or something that does not require changing the imageFilename and filename of PAGE-XML

This is tricky. It's what I meant by solution 2) above. If we assume that input page files use random strings as filename how would you map that back to the mets:file? We tried to require a convention and it failed - reasonably - before even being tested on real-world data (which is even messier, with NFS file paths used as xlink:href or invalid characters for IDs etc).

Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.

This is the what I'd consider the repository approach: A service that accepts PUT/POST requests and GET download requests. I would have preferred that six months ago, but as you said earlier, it's much more effort for the processors to fetch&upload. It's reasonably easy to integrate into core (we experimented with that early on) but not all contributors build on it and it makes testing much harder, requires a repository server etc.

IIUC (@VolkerHartmann @wrznr?) we won't have a repository on the task level, we cannot enforce naming conventions in input data. That leaves us with the option to have external mappings between identifiers, pure file-system access to files and OCRD-ZIP with the planned BagIt+Git extensions (OCR-D/spec#70 and OCR-D/spec#73) as the exchange format.

0 replies

bertsky · 2018-09-12T10:06:42Z

bertsky
Sep 12, 2018
Collaborator

Ok, I got it now, that was really asking for your solution 2.

If we assume that input page files use random strings as filename how would you map that back to the mets:file?

Well I did not say random, I rather meant a convention that fits most existing naming schemes. But I can imagine how that quickly collapses with real-world data.

So (to be more precise) why not just use

mets:fileGrp USE directory and mets:file ID filename

exclusively, with sollution 1 (and not changing URLs in the METS at all)? And invalid characters for ID would be a problem in any case, wouldn't they?

Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.

This is the what I'd consider the repository approach: A service that accepts PUT/POST requests and GET download requests. I would have preferred that six months ago, but as you said earlier, it's much more effort for the processors to fetch&upload. It's reasonably easy to integrate into core (we experimented with that early on) but not all contributors build on it and it makes testing much harder, requires a repository server etc.

I am not sure I understand that, yet. Making persistent in my sense happens at the end of the workflow pipeline. Everything in between can happen locally, and processors could be allowed to create "temporary" annotations (marked by, say, bogus URLs) in between. As you said earlier, at some point that initial/final I/O needs to happen anyway.

I would also favour established standards like BagIt over a self-baked OCRD-ZIP here. But if I am not mistaken, then an external mapping would not be necessary: all the URLs stay in the mets.xml, all directory and file names in the archive (in this case, data/) or filesystem derive from its USE and ID attributes. (And that of course does not rule out OCRD-GITZIP either.)

0 replies

VolkerHartmann · 2018-10-23T13:24:58Z

VolkerHartmann
Oct 23, 2018

Can't we reference via USE and ID. The modules should already know these values as they have to address the file via these values, or?
mets://OCR-D-IMG/OCR-D-IMG_0001

This is also the way we use to download and rename external files referenced in METS.
Ok, mets may be an invalid protocol.

0 replies

bertsky · 2018-10-23T14:05:58Z

bertsky
Oct 23, 2018
Collaborator

Can't we reference via USE and ID.

If by reference you mean store to the filesystem (at workspace add or workspace clone time) and retrieve from the filesystem (within processors), then this is exactly what I was proposing. (I still do not see the necessity of external file-URL bookkeeping.) After all, the workspace is the filesystem "cache" of a document repository (mets.xml + annotations). Why should it even bother with the filename part of its persistent URLs?

0 replies

bertsky · 2019-03-04T12:01:25Z

bertsky
Mar 4, 2019
Collaborator

This is still true with 1.0.0b5. I believe this also affects workspace clone and zip bag besides workspace add.

0 replies

bertsky · 2019-07-19T18:44:46Z

bertsky
Jul 19, 2019
Collaborator

I don't think this can wait until the dev workshop.

0 replies

kba · 2019-09-05T09:16:42Z

kba
Sep 5, 2019
Maintainer Author

Revisiting this with @tboenig:

imageFilename in PAGE must always be a relative file path relative to that PAGE file, otherwise tools like Aletheia or PAGEViewer won't work
mets:FLocat is ideally a relative path from the mets.xml

So we need logic to determine the relative path from mets.xml to image by resolving imageFilename of a PAGE against the relative path to that PAGE.

mets.xml: OCR-D-PAGE/foo.xml
OCR-D-PAGE/foo.xml: ../OCR-D-IMG/foo.tif
=> OCR-D-IMG/foo.tif <- mets:FLocat of that image in mets.xml

0 replies

mikegerber · 2019-09-24T13:20:22Z

mikegerber
Sep 24, 2019

* `imageFilename` in PAGE must always be a relative file path relative to that PAGE file, otherwise tools like Aletheia or PAGEViewer won't work
* `mets:FLocat` is ideally a relative path from the `mets.xml`

Is this the consensus now? Because a. I want/need to use the PAGE Viewer and b. it also seems correct.

0 replies

bertsky · 2019-09-24T13:50:48Z

bertsky
Sep 24, 2019
Collaborator

I think so. But this will have repercussions all over our implementations: until now, everything was relative to METS. And we have an additional interdepenceny between tools and data (GT bags) here. So it might take some time until this is available. Until then we all have to live with the hassle of pointing PageViewer to the image every time.

0 replies

mikegerber · 2019-09-26T14:30:13Z

mikegerber
Sep 26, 2019

I used to automatically correct the imageFilename for easy viewing in PAGE Viewer. But with the latest ocrd 1.0.0b19, the situation is worse because ocrd workspace validate now seems to check for the (in my opinion) incorrect METS-relative filenames.

16:24:54.211 INFO ocrd.resolver.download_to_directory - directory=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524| url=|../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png| basename=|OCR-D-IMG-BIN_0001.png| if_exists=|skip| subdir=|TEMP|
16:24:54.211 INFO ocrd.resolver.download_to_directory - directory=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524| url=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524/../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png| basename=|OCR-D-IMG-BIN_0001.png| if_exists=|skip| subdir=|TEMP|
Traceback (most recent call last):
  File "/home/mike/.virtualenvs/ocrd/lib/python3.7/site-packages/ocrd/workspace.py", line 100, in download_file
    f.url = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
  File "/home/mike/.virtualenvs/ocrd/lib/python3.7/site-packages/ocrd/resolver.py", line 77, in download_to_directory
    raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: ../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png

It also tries to "download" the local file to TEMP and so this seems to be connected to issue #324.

0 replies

bertsky · 2019-09-26T14:40:19Z

bertsky
Sep 26, 2019
Collaborator

I am sure the new validation was added in preparation of fixing this within the new logic.

But there is a simple remedy: just --skip=imageFilename

0 replies

mikegerber · 2019-09-26T15:14:31Z

mikegerber
Sep 26, 2019

Not remedied using the latest master which has this skip option:

% ocrd workspace validate --skip pixel_density --skip imagefilename mets.xml
Traceback (most recent call last):
  File "/home/mike/devel/OCR-D/core/ocrd/ocrd/workspace.py", line 100, in download_file
    f.url = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
  File "/home/mike/devel/OCR-D/core/ocrd/ocrd/resolver.py", line 77, in download_to_directory
    raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: ../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png

0 replies

kba · 2019-10-16T15:33:38Z

kba
Oct 16, 2019
Maintainer Author

PAGE filenames will have to be relative to the METS. PAGE Viewer and Aletheia will have options to change the base for relative filenames. Since #333 PAGE filenames in OCRD-ZIP will be updated, but this has not yet been implemented for general workspace methods.

0 replies

bertsky · 2020-01-10T15:13:41Z

bertsky
Jan 10, 2020
Collaborator

So all that remains to do here is fixing workspace add, right?

It should be simple to implement something along the lines of https://github.com/OCR-D/docs/blob/master/fix-gt.sh in core Python...

0 replies

cneud · 2020-01-10T15:30:47Z

cneud
Jan 10, 2020
Maintainer

I admit I am slightly puzzled what still needs fixing here...IIUC, there must not/cannot be a case where the PAGE imageFilename IS NOT relative to the mets.xml - either a PAGE file has been created by some ocrd-* process and thus should always be relative to the mets.xml or the PAGE file is ground truth in which case we also (need to) ensure this is the case. Or am I missing sth? Do you have an example @bertsky?

0 replies

kba · 2020-01-10T15:48:02Z

kba
Jan 10, 2020
Maintainer Author

Until then we all have to live with the hassle of pointing PageViewer to the image every time.

PAGE Viewer has --resolve-dir now PRImA-Research-Lab/prima-page-viewer#6

Do you have an example @bertsky?

I would also find that helpful. I'm having a hard time thinking of a case where we add to a workflow PAGE-XML that does not already adhere to the imageFilename-relative-to-mets / imageFilename-must-be-in-METS patterns. In most cases, workflows will start with images from which we derive PAGE-XML with correct imageFilename, don't we?

0 replies

bertsky · 2020-01-12T00:17:29Z

bertsky
Jan 12, 2020
Collaborator

IIUC, there must not/cannot be a case where the PAGE imageFilename IS NOT relative to the mets.xml - either a PAGE file has been created by some ocrd-* process and thus should always be relative to the mets.xml or the PAGE file is ground truth in which case we also (need to) ensure this is the case. Or am I missing sth?

Neither of these cases is what ocrd workspace add is typically used for. You need this for GT files from other sources (or OCR-D GT releases before BagIt/METS, which even now are the only GT with text content). These have varying @imageFilename conventions, depending on their directory structure. Now when ocrd workspace add reads a PAGE-XML file, it can still resolve the original image in the filesystem, and try to rebase to the workspace.

One obvious use-case would be ocrd-import. (But in that repo, you can still work around the problem by doing ocrd-make repair afterwards, at least sometimes)

But maybe, you'd say, this is too difficult to get right in ocrd workspace add, please use ocrd zip bag for that! But how will this work, if the old URL did not work to begin with?

0 replies

kba · 2020-01-13T16:56:38Z

kba
Jan 13, 2020
Maintainer Author

when ocrd workspace add reads a PAGE-XML file, it can still resolve the original image in the filesystem, and try to rebase to the workspace.
[...]
But maybe, you'd say, this is too difficult to get right in ocrd workspace add

It's a simple enough feature, questions:

How to determine file metadata for the imageFilename? Media Type can be guessed but what mets:fileGrp to add the images to? Maybe the filegroup used as the input plus suffix -IMG?
Moving images and PAGE to the workspace will require changing the input PAGE. Not really a question, just a statement
Any issues that arise from necessary conventions for this are the user's responsibility, i.e. if they want to set a different name or different media type for an image, they either need to post-process the XML themselves or not use this feature and do the image adding themselves as before
Also do this for AlternativeImage? Does anyone beside us even use them? I suppose yes and no.

Let's make it toggleable with a --include-page-images/--no-include-page-images or similar flag.

Let's default NOT to do this because it really only makes sense when importing data, not. e.g everytime a bashlib processor wants to add an image.

0 replies

bertsky · 2020-01-16T10:56:38Z

bertsky
Jan 16, 2020
Collaborator

* Moving images and PAGE to the workspace will require changing the input PAGE. Not really a question, just a statement

Yes, that's crucial. If we take this seriously, ocrd workspace add on PAGE-XML files will either take control of that file or make a copy of it (under the "right" path).

* Also do this for AlternativeImage? Does anyone beside us even use them? I suppose yes and no.

I guess we have to consider the possibility. If we solve this conceptually for Page/@imageFilename, it should work the same for AlternativeImage/@filename though.

* How to determine file metadata for the `imageFilename`? Media Type can be guessed but what `mets:fileGrp` to add the images to? Maybe the filegroup used as the input plus suffix `-IMG`?

IIUC you assume here that ocrd workspace add will be responsible for adding the image file along with the PAGE-XML file passed to it. We could have other provisions (like assuming the image file must already have been added by then), but let's follow this logic for now:

Yes, the image could be placed under a fileGrp implicitly derived from the fileGrp for the PAGE-XML, or even the same fileGrp (just with a different MIME type and not appearing in the structMap).

Let's make it toggleable with a --include-page-images/--no-include-page-images or similar flag.

If we add an option, why not just the name of the image file group (or none for "ignore images")?

* Any issues that arise from necessary conventions for this are the user's responsibility, i.e. if they want to set a different name or different media type for an image, they either need to post-process the XML themselves or not use this feature and do the image adding themselves as before

Right. And let's think about the second use-case (adding PAGE-XML after image) more thoroughly: Now ocrd workspace add can go looking for the (basename of the) filename in the (image) flocat URLs of the METS, and calculate the new relative path for the PAGE-XML under its destination directory. If it does not find an image with that filename, it can still go looking for an image with the same pageId. And then it can fail loudly.

Personally, I think this is the more sensible interface than add-image-via-PAGE.

Let's default NOT to do this because it really only makes sense when importing data, not. e.g everytime a bashlib processor wants to add an image.

This got me confused: I though we are talking about adding PAGE-XML files here?

0 replies

Matching PAGE imageFilename to mets:file when imageFilename is not a URL #771

kba Aug 30, 2018 Maintainer

Replies: 34 comments

kba Aug 31, 2018 Maintainer Author

wrznr Aug 31, 2018

kba Aug 31, 2018 Maintainer Author

wrznr Aug 31, 2018

VolkerHartmann Sep 4, 2018

kba Sep 4, 2018 Maintainer Author

bertsky Sep 8, 2018 Collaborator

kba Sep 10, 2018 Maintainer Author

bertsky Sep 10, 2018 Collaborator

kba Sep 10, 2018 Maintainer Author

bertsky Sep 10, 2018 Collaborator

kba Sep 11, 2018 Maintainer Author

bertsky Sep 12, 2018 Collaborator

VolkerHartmann Oct 23, 2018

bertsky Oct 23, 2018 Collaborator

bertsky Mar 4, 2019 Collaborator

bertsky Jul 19, 2019 Collaborator

kba Sep 5, 2019 Maintainer Author

mikegerber Sep 24, 2019

bertsky Sep 24, 2019 Collaborator

mikegerber Sep 26, 2019

bertsky Sep 26, 2019 Collaborator

mikegerber Sep 26, 2019

kba Oct 16, 2019 Maintainer Author

bertsky Jan 10, 2020 Collaborator

cneud Jan 10, 2020 Maintainer

kba Jan 10, 2020 Maintainer Author

bertsky Jan 12, 2020 Collaborator

kba Jan 13, 2020 Maintainer Author

bertsky Jan 16, 2020 Collaborator

kba
Aug 30, 2018
Maintainer

kba
Aug 31, 2018
Maintainer Author

wrznr
Aug 31, 2018

kba
Aug 31, 2018
Maintainer Author

wrznr
Aug 31, 2018

VolkerHartmann
Sep 4, 2018

kba
Sep 4, 2018
Maintainer Author

bertsky
Sep 8, 2018
Collaborator

kba
Sep 10, 2018
Maintainer Author

bertsky
Sep 10, 2018
Collaborator

kba
Sep 10, 2018
Maintainer Author

bertsky
Sep 10, 2018
Collaborator

kba
Sep 11, 2018
Maintainer Author

bertsky
Sep 12, 2018
Collaborator

VolkerHartmann
Oct 23, 2018

bertsky
Oct 23, 2018
Collaborator

bertsky
Mar 4, 2019
Collaborator

bertsky
Jul 19, 2019
Collaborator

kba
Sep 5, 2019
Maintainer Author

mikegerber
Sep 24, 2019

bertsky
Sep 24, 2019
Collaborator

mikegerber
Sep 26, 2019

bertsky
Sep 26, 2019
Collaborator

mikegerber
Sep 26, 2019

kba
Oct 16, 2019
Maintainer Author

bertsky
Jan 10, 2020
Collaborator

cneud
Jan 10, 2020
Maintainer

kba
Jan 10, 2020
Maintainer Author

bertsky
Jan 12, 2020
Collaborator

kba
Jan 13, 2020
Maintainer Author

bertsky
Jan 16, 2020
Collaborator