-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
port processor to core v3 #130
base: machine_based_reading_order_integration
Are you sure you want to change the base?
Conversation
# Conflicts: # qurator/eynollah/processor.py
# Conflicts: # qurator/eynollah/processor.py
# Conflicts: # setup.py
# Conflicts: # qurator/eynollah/processor.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks – LGTM!
Have not tested yet, though.
Current main
also looks very promising – will give it a try myself
qurator/eynollah/processor.py
Outdated
image_filename=page.imageFilename, | ||
image_pil=page_image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: that filename might not be where that image came from in workspace.image_from_page
. It could well be a derived image generated by some previous processor (just not a cropped, deskewed or binarized image, because that would have changed its coordinate system).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still a bit hazy for me when image_filename
is actually used. Ideally, image_pil
should take preference and image_filename
is only for the plotter/writer, at least in the "single image mode" we're using.
One of the aspects I hope I'll be able to improve a bit with https://github.com/qurator-spk/eynollah/tree/refactoring-2024-08/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can also re-use session
across Eynollah invokations in addition to models
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, yes, but with standalone eynollah being focused on batch processing now, I am honestly not sure how/where sessions are defined for the non-dir_in
option - @vahidrezanezhad can you tell us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify (now with deeper understanding of the codebase):
- Indeed,
image_pil
does take precedence overimage_filename
, as intended. Nevertheless, in the writer, we must not use an image file as new@imageFilename
that does not actually correspond to theimage_pil
which we were passing in. So the initial critique stands: we should try to getimage_pil.filename
, and if that is not available (because the image was just cropped or deskewed according to the annotation), then we must save it to a new file and annotate that along with the PAGE in the output fileGrp. This is the most flexible approach: you can run Eynollah on a pure image fileGrp, and it will use that as@imageFilename
, or (OCRD-style) on some earlier processing result (possibly including cropping or deskewing or even binarization steps), and then Eynollah would respect this input, but (being monolithic) redefine@imageFilename
to be from those derived images. session
is not a thing outside ofdir_in
mode, keepingmodels
is sufficient- the overhead of re-instantiating
Eynollah
for each page is not a problem IMHO.
qurator/eynollah/processor.py
Outdated
# if not('://' in page.imageFilename): | ||
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename | ||
# else: | ||
# # could be a URL with file:// or truly remote | ||
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# if not('://' in page.imageFilename): | |
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename | |
# else: | |
# # could be a URL with file:// or truly remote | |
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename |
This whole effort was to ensure we can pass a working local filename, as (was) needed by Eynollah. The approach by OCR-D is Workspace.image_from_page
/ Workspace.image_from_segment
which will search for the right original or derived image, download it if necessary and load it into memory.
I don't recall what the new behaviour of Eynollah is. If both an image filename and an image object are passed, who wins?
Assuming it's the memory object: this can be removed. (But then I wonder why we still pass the image filename at all...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, currently we have
if image_pil:
self._imgs = self._cache_images(image_pil=image_pil)
else:
self._imgs = self._cache_images(image_filename=image_filename)
[...]
def _cache_images(self, image_filename=None, image_pil=None):
ret = {}
if image_filename:
ret['img'] = cv2.imread(image_filename)
self.dpi = check_dpi(image_filename)
else:
ret['img'] = pil2cv(image_pil)
self.dpi = check_dpi(image_pil)
image_filename
is (should) then only used passively, to generate filenames of plotted debug images as well as for PAGE serialization.
So I think image_pil
should win but for now we need both. But as I said above, one of those things I would love to untangle in the refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above – I'll make a suggestion for getting image_filename
from the image_pil
in a fresh review.
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
OCR-D v3 API: fixes
BTW, I just tested under (METS Server and) I'm not sure if this warrants adding |
# Conflicts: # pyproject.toml # src/eynollah/cli.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I have not tested this, @kba!
@@ -1,5 +1,5 @@ | |||
# ocrd includes opencv, numpy, shapely, click | |||
ocrd >= 2.23.3 | |||
ocrd >= 3.0.0b4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ocrd >= 3.0.0b4 | |
ocrd >= 3.0.2 |
# if not('://' in page.imageFilename): | ||
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename | ||
# else: | ||
# # could be a URL with file:// or truly remote | ||
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# if not('://' in page.imageFilename): | |
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename | |
# else: | |
# # could be a URL with file:// or truly remote | |
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename |
assert input_pcgts | ||
assert input_pcgts[0] | ||
assert self.parameter | ||
pcgts = input_pcgts[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pcgts = input_pcgts[0] | |
pcgts = input_pcgts[0] | |
result = OcrdPageResult(pcgts) |
eynollah.models = self.models | ||
eynollah.run() | ||
self.models = eynollah.models | ||
return OcrdPageResult(pcgts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return OcrdPageResult(pcgts) | |
return result |
# (the PAGE builder merely adds regions, so afterwards we would not know which to transform) | ||
# also avoid binarization as models usually fare better on grayscale/RGB | ||
feature_filter='cropped,deskewed,binarized') | ||
eynollah = Eynollah( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eynollah = Eynollah( | |
if hasattr(page_image, 'filename'): | |
image_filename = page_image.filename | |
else: | |
image_filename = "dummy" # will be replaced by ocrd.Processor.process_page_file | |
result.images.append(OcrdPageResultImage(page_image, '.IMG', page)) # mark as new original | |
eynollah = Eynollah( |
tables=self.parameter['tables'], | ||
override_dpi=self.parameter['dpi'], | ||
pcgts=pcgts, | ||
image_filename=page.imageFilename, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image_filename=page.imageFilename, | |
image_filename=image_filename, |
"size": 1894627041, | ||
"type": "archive", | ||
"path_in_archive": "models_eynollah" | ||
} | ||
] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
}, | |
"dockerhub": "ocrd/eynollah" |
|
||
from .eynollah import Eynollah | ||
from .utils.pil_cv2 import pil2cv | ||
|
||
OCRD_TOOL = loads(resource_string(__name__, 'ocrd-tool.json').decode('utf8')) | ||
|
||
class EynollahProcessor(Processor): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# already employs background CPU multiprocessing per page | |
# already employs GPU (without singleton process atm) | |
max_workers = 1 | |
With this PR, eynollah supports OCR-D/core#1240. It simplifies it a lot too.
I'll update the
ocrd-tool.json
with the changed/added flags here as well.Draft, please don't merge until v3 stable is released