Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port processor to core v3 #130

Draft
wants to merge 28 commits into
base: machine_based_reading_order_integration
Choose a base branch
from
Draft

Conversation

kba
Copy link
Contributor

@kba kba commented Aug 23, 2024

With this PR, eynollah supports OCR-D/core#1240. It simplifies it a lot too.

I'll update the ocrd-tool.json with the changed/added flags here as well.

Draft, please don't merge until v3 stable is released

Copy link
Contributor

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks – LGTM!

Have not tested yet, though.

Current main also looks very promising – will give it a try myself

Comment on lines 53 to 54
image_filename=page.imageFilename,
image_pil=page_image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: that filename might not be where that image came from in workspace.image_from_page. It could well be a derived image generated by some previous processor (just not a cropped, deskewed or binarized image, because that would have changed its coordinate system).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still a bit hazy for me when image_filename is actually used. Ideally, image_pil should take preference and image_filename is only for the plotter/writer, at least in the "single image mode" we're using.

One of the aspects I hope I'll be able to improve a bit with https://github.com/qurator-spk/eynollah/tree/refactoring-2024-08/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can also re-use session across Eynollah invokations in addition to models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, yes, but with standalone eynollah being focused on batch processing now, I am honestly not sure how/where sessions are defined for the non-dir_in option - @vahidrezanezhad can you tell us?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify (now with deeper understanding of the codebase):

  1. Indeed, image_pil does take precedence over image_filename, as intended. Nevertheless, in the writer, we must not use an image file as new @imageFilename that does not actually correspond to the image_pil which we were passing in. So the initial critique stands: we should try to get image_pil.filename, and if that is not available (because the image was just cropped or deskewed according to the annotation), then we must save it to a new file and annotate that along with the PAGE in the output fileGrp. This is the most flexible approach: you can run Eynollah on a pure image fileGrp, and it will use that as @imageFilename, or (OCRD-style) on some earlier processing result (possibly including cropping or deskewing or even binarization steps), and then Eynollah would respect this input, but (being monolithic) redefine @imageFilename to be from those derived images.
  2. session is not a thing outside of dir_in mode, keeping models is sufficient
  3. the overhead of re-instantiating Eynollah for each page is not a problem IMHO.

Comment on lines 26 to 30
# if not('://' in page.imageFilename):
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename
# else:
# # could be a URL with file:// or truly remote
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# if not('://' in page.imageFilename):
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename
# else:
# # could be a URL with file:// or truly remote
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename

This whole effort was to ensure we can pass a working local filename, as (was) needed by Eynollah. The approach by OCR-D is Workspace.image_from_page / Workspace.image_from_segment which will search for the right original or derived image, download it if necessary and load it into memory.

I don't recall what the new behaviour of Eynollah is. If both an image filename and an image object are passed, who wins?

Assuming it's the memory object: this can be removed. (But then I wonder why we still pass the image filename at all...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, currently we have

            if image_pil:
                self._imgs = self._cache_images(image_pil=image_pil)
            else:
                self._imgs = self._cache_images(image_filename=image_filename)
[...]
     def _cache_images(self, image_filename=None, image_pil=None):
         ret = {}
         if image_filename:
             ret['img'] = cv2.imread(image_filename)
             self.dpi = check_dpi(image_filename)
         else:
             ret['img'] = pil2cv(image_pil)
             self.dpi = check_dpi(image_pil)

image_filename is (should) then only used passively, to generate filenames of plotted debug images as well as for PAGE serialization.

So I think image_pil should win but for now we need both. But as I said above, one of those things I would love to untangle in the refactoring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above – I'll make a suggestion for getting image_filename from the image_pil in a fresh review.

@bertsky
Copy link
Contributor

bertsky commented Sep 2, 2024

BTW, I just tested under (METS Server and) OCRD_MAX_PARALLEL_PAGES=2 – it works, but you need lots of GPU memory, otherwise GPU OOM happens. (It does work with CUDA_VISIBLE_DEVICES=, but of course the CPU utilization grows, so that might stall the system.)

I'm not sure if this warrants adding max_workers = 1 to EynollahProcessor ...

# Conflicts:
#	pyproject.toml
#	src/eynollah/cli.py
Copy link
Contributor

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I have not tested this, @kba!

@@ -1,5 +1,5 @@
# ocrd includes opencv, numpy, shapely, click
ocrd >= 2.23.3
ocrd >= 3.0.0b4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ocrd >= 3.0.0b4
ocrd >= 3.0.2

Comment on lines +21 to +25
# if not('://' in page.imageFilename):
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename
# else:
# # could be a URL with file:// or truly remote
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# if not('://' in page.imageFilename):
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename
# else:
# # could be a URL with file:// or truly remote
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename

assert input_pcgts
assert input_pcgts[0]
assert self.parameter
pcgts = input_pcgts[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pcgts = input_pcgts[0]
pcgts = input_pcgts[0]
result = OcrdPageResult(pcgts)

eynollah.models = self.models
eynollah.run()
self.models = eynollah.models
return OcrdPageResult(pcgts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return OcrdPageResult(pcgts)
return result

# (the PAGE builder merely adds regions, so afterwards we would not know which to transform)
# also avoid binarization as models usually fare better on grayscale/RGB
feature_filter='cropped,deskewed,binarized')
eynollah = Eynollah(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
eynollah = Eynollah(
if hasattr(page_image, 'filename'):
image_filename = page_image.filename
else:
image_filename = "dummy" # will be replaced by ocrd.Processor.process_page_file
result.images.append(OcrdPageResultImage(page_image, '.IMG', page)) # mark as new original
eynollah = Eynollah(

tables=self.parameter['tables'],
override_dpi=self.parameter['dpi'],
pcgts=pcgts,
image_filename=page.imageFilename,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
image_filename=page.imageFilename,
image_filename=image_filename,

"size": 1894627041,
"type": "archive",
"path_in_archive": "models_eynollah"
}
]
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
},
"dockerhub": "ocrd/eynollah"


from .eynollah import Eynollah
from .utils.pil_cv2 import pil2cv

OCRD_TOOL = loads(resource_string(__name__, 'ocrd-tool.json').decode('utf8'))

class EynollahProcessor(Processor):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# already employs background CPU multiprocessing per page
# already employs GPU (without singleton process atm)
max_workers = 1

@kba kba changed the base branch from main to machine_based_reading_order_integration March 6, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants