Replies: 34 comments
-
Solution 1) On Solution 2) Have a in place to map from Solution 3) Track the relation externally. A mechanism like this will be necessary anyway because of the problem of local URLs irreversibly replacing remote URLs when downloading files. |
Beta Was this translation helpful? Give feedback.
-
Without overseeing the technical consequences: Only a cosmetic nastiness? I am not sure we ever touch the file refs in PAGE, do we? |
Beta Was this translation helpful? Give feedback.
-
See https://ocr-d.github.io/page#url-for-imagefilename--filename The |
Beta Was this translation helpful? Give feedback.
-
Ah, okay. In this case, 👍 for solution 3. |
Beta Was this translation helpful? Give feedback.
-
Solution 1 only works if workspace add is used which may be a drawback. |
Beta Was this translation helpful? Give feedback.
-
Solution 3 reasonably also only works on |
Beta Was this translation helpful? Give feedback.
-
Pardon my being slow-witted, but what was the reason for https://ocr-d.github.io/page#url-for-imagefilename--filename (always requiring a URL, even if local) in the first place? Why not use relative paths (without I thought the workspace metaphor would work like a DVCS repository. But if we require URLs everywhere, I cannot move my workspaces around in the filesystem. Am I supposed to (BTW, |
Beta Was this translation helpful? Give feedback.
-
The original plan was to completely forgo the filesystem and use a repository for all intermediate results, not just of workflow runs but of individual processors The workspace is the place where processors "do their thing", a mere implementation-specific helper for a processor. We considered the mets.xml to be the single Nowadays, full provenance and reproducibility of every single step is not our top priority anymore. This allows us making that workspace/Git-like approach a first- |
Beta Was this translation helpful? Give feedback.
-
Thanks, it makes sense to me now. But what still escapes me is the logic of:
I completely agree as far as Anyway, if I understand you correctly, you will move towards allowing local intermediate steps and workspaces as true DVCS. Can I conclude from that relative file names will be your preferred solution for this issue, too? (Or am I misreading your explanation?) |
Beta Was this translation helpful? Give feedback.
-
In mass digitisation we cannot assume that mets.xml and referenced data are on the same FS (workspacec/dvcs metaphor) so the mets.xml acts more as a manifest.
Of course you need some form of caching on a local filesystem. Hence the workspace: Create a local folder with all required files for a processor to work on. In fact that was why originally those were created in But once the local processes are complete, ensure that all data is stored persistently and no references to local files remain. You need to do that I/O at some point, download all the files, keep track which local file represents which file URL, and in the end store it somewhere persistenly.
No, I would still prefer URL to be used in the data. The best way to avoid having references to local-only data is not to persist it. Instead, I'd be for a mechanism to map local filenames to opaque identifiers, such as a URL or whatever string is in the |
Beta Was this translation helpful? Give feedback.
-
Sorry, I somehow forgot about that (it seems strange to me now, too). Then the DVCS metaphor is perhaps misleading. So how about this new scheme: A workspace is nothing but an identical copy of the remote mets.xml (using only public URLs) plus the files in relative paths of the local FS – by the same path name convention as ocrd-zip (or something that does not require changing the
Yes, understood. In the above scheme, there would be no local reference (to keep track of) any more. So as soon as a new files gets added, one would be required to provide a URL for it. Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs. |
Beta Was this translation helpful? Give feedback.
-
I think it is useful for individual processes, as an abstraction for implementers and for workflow/archiving purposes. But from the perspective of a digitisation engineer/sysadmin, it's best to assume:
This is tricky. It's what I meant by solution 2) above. If we assume that input page files use random strings as
This is the what I'd consider the repository approach: A service that accepts PUT/POST requests and GET download requests. I would have preferred that six months ago, but as you said earlier, it's much more effort for the processors to fetch&upload. It's reasonably easy to integrate into IIUC (@VolkerHartmann @wrznr?) we won't have a repository on the task level, we cannot enforce naming conventions in input data. That leaves us with the option to have external mappings between identifiers, pure file-system access to files and OCRD-ZIP with the planned BagIt+Git extensions (OCR-D/spec#70 and OCR-D/spec#73) as the exchange format. |
Beta Was this translation helpful? Give feedback.
-
Ok, I got it now, that was really asking for your solution 2.
Well I did not say random, I rather meant a convention that fits most existing naming schemes. But I can imagine how that quickly collapses with real-world data. So (to be more precise) why not just use
exclusively, with sollution 1 (and not changing URLs in the METS at all)? And invalid characters for ID would be a problem in any case, wouldn't they?
I am not sure I understand that, yet. Making persistent in my sense happens at the end of the workflow pipeline. Everything in between can happen locally, and processors could be allowed to create "temporary" annotations (marked by, say, bogus URLs) in between. As you said earlier, at some point that initial/final I/O needs to happen anyway. I would also favour established standards like BagIt over a self-baked OCRD-ZIP here. But if I am not mistaken, then an external mapping would not be necessary: all the URLs stay in the mets.xml, all directory and file names in the archive (in this case, |
Beta Was this translation helpful? Give feedback.
-
Can't we reference via USE and ID. The modules should already know these values as they have to address the file via these values, or? This is also the way we use to download and rename external files referenced in METS. |
Beta Was this translation helpful? Give feedback.
-
If by reference you mean store to the filesystem (at |
Beta Was this translation helpful? Give feedback.
-
This is still true with 1.0.0b5. I believe this also affects |
Beta Was this translation helpful? Give feedback.
-
I don't think this can wait until the dev workshop. |
Beta Was this translation helpful? Give feedback.
-
Revisiting this with @tboenig:
So we need logic to determine the relative path from mets.xml to image by resolving imageFilename of a PAGE against the relative path to that PAGE.
|
Beta Was this translation helpful? Give feedback.
-
Is this the consensus now? Because a. I want/need to use the PAGE Viewer and b. it also seems correct. |
Beta Was this translation helpful? Give feedback.
-
I think so. But this will have repercussions all over our implementations: until now, everything was relative to METS. And we have an additional interdepenceny between tools and data (GT bags) here. So it might take some time until this is available. Until then we all have to live with the hassle of pointing PageViewer to the image every time. |
Beta Was this translation helpful? Give feedback.
-
I used to automatically correct the
It also tries to "download" the local file to |
Beta Was this translation helpful? Give feedback.
-
I am sure the new validation was added in preparation of fixing this within the new logic. But there is a simple remedy: just |
Beta Was this translation helpful? Give feedback.
-
Not remedied using the latest master which has this skip option:
|
Beta Was this translation helpful? Give feedback.
-
PAGE filenames will have to be relative to the METS. PAGE Viewer and Aletheia will have options to change the base for relative filenames. Since #333 PAGE filenames in OCRD-ZIP will be updated, but this has not yet been implemented for general workspace methods. |
Beta Was this translation helpful? Give feedback.
-
So all that remains to do here is fixing It should be simple to implement something along the lines of https://github.com/OCR-D/docs/blob/master/fix-gt.sh in core Python... |
Beta Was this translation helpful? Give feedback.
-
I admit I am slightly puzzled what still needs fixing here...IIUC, there must not/cannot be a case where the PAGE |
Beta Was this translation helpful? Give feedback.
-
PAGE Viewer has
I would also find that helpful. I'm having a hard time thinking of a case where we add to a workflow PAGE-XML that does not already adhere to the |
Beta Was this translation helpful? Give feedback.
-
Neither of these cases is what One obvious use-case would be ocrd-import. (But in that repo, you can still work around the problem by doing But maybe, you'd say, this is too difficult to get right in |
Beta Was this translation helpful? Give feedback.
-
It's a simple enough feature, questions:
Let's make it toggleable with a Let's default NOT to do this because it really only makes sense when importing data, not. e.g everytime a bashlib processor wants to add an image. |
Beta Was this translation helpful? Give feedback.
-
Yes, that's crucial. If we take this seriously,
I guess we have to consider the possibility. If we solve this conceptually for
IIUC you assume here that Yes, the image could be placed under a fileGrp implicitly derived from the fileGrp for the PAGE-XML, or even the same fileGrp (just with a different MIME type and not appearing in the structMap).
If we add an option, why not just the name of the image file group (or none for "ignore images")?
Right. And let's think about the second use-case (adding PAGE-XML after image) more thoroughly: Now Personally, I think this is the more sensible interface than add-image-via-PAGE.
This got me confused: I though we are talking about adding PAGE-XML files here? |
Beta Was this translation helpful? Give feedback.
-
Scenario:
Image files and PAGE referencing those image files by relative filepath:
Create a METS file and run
workspace add
:Now the PAGE
imageFilename
andxlink:href
of the corresponding mets:file do not match anymore.Beta Was this translation helpful? Give feedback.
All reactions