port processor to core v3 #130

kba · 2024-08-23T16:22:34Z

With this PR, eynollah supports OCR-D/core#1240. It simplifies it a lot too.

I'll update the ocrd-tool.json with the changed/added flags here as well.

Draft, please don't merge until v3 stable is released

# Conflicts: # qurator/eynollah/processor.py

# Conflicts: # setup.py

# Conflicts: # qurator/eynollah/processor.py

bertsky

Thanks – LGTM!

Have not tested yet, though.

Current main also looks very promising – will give it a try myself

qurator/eynollah/processor.py

bertsky · 2024-08-24T23:05:29Z

qurator/eynollah/processor.py

+            image_filename=page.imageFilename,
+            image_pil=page_image


Note: that filename might not be where that image came from in workspace.image_from_page. It could well be a derived image generated by some previous processor (just not a cropped, deskewed or binarized image, because that would have changed its coordinate system).

It's still a bit hazy for me when image_filename is actually used. Ideally, image_pil should take preference and image_filename is only for the plotter/writer, at least in the "single image mode" we're using.

One of the aspects I hope I'll be able to improve a bit with https://github.com/qurator-spk/eynollah/tree/refactoring-2024-08/

Perhaps we can also re-use session across Eynollah invokations in addition to models?

In theory, yes, but with standalone eynollah being focused on batch processing now, I am honestly not sure how/where sessions are defined for the non-dir_in option - @vahidrezanezhad can you tell us?

Just to clarify (now with deeper understanding of the codebase):

Indeed, image_pil does take precedence over image_filename, as intended. Nevertheless, in the writer, we must not use an image file as new @imageFilename that does not actually correspond to the image_pil which we were passing in. So the initial critique stands: we should try to get image_pil.filename, and if that is not available (because the image was just cropped or deskewed according to the annotation), then we must save it to a new file and annotate that along with the PAGE in the output fileGrp. This is the most flexible approach: you can run Eynollah on a pure image fileGrp, and it will use that as @imageFilename, or (OCRD-style) on some earlier processing result (possibly including cropping or deskewing or even binarization steps), and then Eynollah would respect this input, but (being monolithic) redefine @imageFilename to be from those derived images.

session is not a thing outside of dir_in mode, keeping models is sufficient

the overhead of re-instantiating Eynollah for each page is not a problem IMHO.

qurator/eynollah/processor.py

bertsky · 2024-08-24T23:12:25Z

qurator/eynollah/processor.py

+        # if not('://' in page.imageFilename):
+        #     image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename
+        # else:
+        #     # could be a URL with file:// or truly remote
+        #     image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename


Suggested change

# if not('://' in page.imageFilename):

# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename

# else:

# # could be a URL with file:// or truly remote

# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename

This whole effort was to ensure we can pass a working local filename, as (was) needed by Eynollah. The approach by OCR-D is Workspace.image_from_page / Workspace.image_from_segment which will search for the right original or derived image, download it if necessary and load it into memory.

I don't recall what the new behaviour of Eynollah is. If both an image filename and an image object are passed, who wins?

Assuming it's the memory object: this can be removed. (But then I wonder why we still pass the image filename at all...)

Well, currently we have

if image_pil: self._imgs = self._cache_images(image_pil=image_pil) else: self._imgs = self._cache_images(image_filename=image_filename) [...] def _cache_images(self, image_filename=None, image_pil=None): ret = {} if image_filename: ret['img'] = cv2.imread(image_filename) self.dpi = check_dpi(image_filename) else: ret['img'] = pil2cv(image_pil) self.dpi = check_dpi(image_pil)

image_filename is (should) then only used passively, to generate filenames of plotted debug images as well as for PAGE serialization.

So I think image_pil should win but for now we need both. But as I said above, one of those things I would love to untangle in the refactoring.

See above – I'll make a suggestion for getting image_filename from the image_pil in a fresh review.

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

…tor)

…pport in core)

OCR-D v3 API: fixes

bertsky · 2024-09-02T11:13:53Z

BTW, I just tested under (METS Server and) OCRD_MAX_PARALLEL_PAGES=2 – it works, but you need lots of GPU memory, otherwise GPU OOM happens. (It does work with CUDA_VISIBLE_DEVICES=, but of course the CPU utilization grows, so that might stall the system.)

I'm not sure if this warrants adding max_workers = 1 to EynollahProcessor ...

# Conflicts: # pyproject.toml # src/eynollah/cli.py

bertsky

note: I have not tested this, @kba!

bertsky · 2025-02-12T16:00:32Z

requirements.txt

@@ -1,5 +1,5 @@
 # ocrd includes opencv, numpy, shapely, click
-ocrd >= 2.23.3
+ocrd >= 3.0.0b4


Suggested change

ocrd >= 3.0.0b4

ocrd >= 3.0.2

bertsky · 2025-02-12T16:03:13Z

src/eynollah/processor.py

+        # if not('://' in page.imageFilename):
+        #     image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename
+        # else:
+        #     # could be a URL with file:// or truly remote
+        #     image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename


Suggested change

# if not('://' in page.imageFilename):

# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename

# else:

# # could be a URL with file:// or truly remote

# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename

bertsky · 2025-02-12T16:10:21Z

src/eynollah/processor.py

+        assert input_pcgts
+        assert input_pcgts[0]
+        assert self.parameter
+        pcgts = input_pcgts[0]


Suggested change

pcgts = input_pcgts[0]

pcgts = input_pcgts[0]

result = OcrdPageResult(pcgts)

bertsky · 2025-02-12T16:10:35Z

src/eynollah/processor.py

+            eynollah.models = self.models
+        eynollah.run()
+        self.models = eynollah.models
+        return OcrdPageResult(pcgts)


Suggested change

return OcrdPageResult(pcgts)

return result

bertsky · 2025-02-12T16:12:42Z

src/eynollah/processor.py

+            # (the PAGE builder merely adds regions, so afterwards we would not know which to transform)
+            # also avoid binarization as models usually fare better on grayscale/RGB
+            feature_filter='cropped,deskewed,binarized')
+        eynollah = Eynollah(


Suggested change

eynollah = Eynollah(

if hasattr(page_image, 'filename'):

image_filename = page_image.filename

else:

image_filename = "dummy" # will be replaced by ocrd.Processor.process_page_file

result.images.append(OcrdPageResultImage(page_image, '.IMG', page)) # mark as new original

eynollah = Eynollah(

bertsky · 2025-02-12T16:13:11Z

src/eynollah/processor.py

+            tables=self.parameter['tables'],
+            override_dpi=self.parameter['dpi'],
+            pcgts=pcgts,
+            image_filename=page.imageFilename,


Suggested change

image_filename=page.imageFilename,

image_filename=image_filename,

bertsky · 2025-02-12T16:14:43Z

src/eynollah/ocrd-tool.json

+          "size": 1894627041,
+          "type": "archive",
+          "path_in_archive": "models_eynollah"
+        }
      ]
    }
  }


Suggested change

}

},

"dockerhub": "ocrd/eynollah"

bertsky · 2025-02-12T16:16:07Z

src/eynollah/processor.py


 from .eynollah import Eynollah
-from .utils.pil_cv2 import pil2cv
-
-OCRD_TOOL = loads(resource_string(__name__, 'ocrd-tool.json').decode('utf8'))

 class EynollahProcessor(Processor):



Suggested change

# already employs background CPU multiprocessing per page

# already employs GPU (without singleton process atm)

max_workers = 1

kba added 6 commits August 23, 2024 18:22

port processor to core v3

0a3f525

class Eynollah: add typing, consistent interface in CLI and OCR-D CLI

4a13781

ocrd-tool: add "allow_enhancement" parameter

9ce02a5

update processor to the latest change in bertsky/core#14

0d83db7

ocrd interface: add light_mode parameter

87adc4b

ocrd interface: add textline_light

39b16e5

kba force-pushed the v3-api branch from cc0e8e3 to 39b16e5 Compare August 24, 2024 16:04

kba and others added 8 commits August 24, 2024 18:05

ocrd interface: add right_to_left

ddcc019

ocrd interface: add ignore_page_extraction

d7caeb2

adapt to ocrd>=2.54 url vs local_filename

8dfecb7

# Conflicts: # qurator/eynollah/processor.py

adapt to OcrdFile.local_filename now :Path

3381e5a

# Conflicts: # qurator/eynollah/processor.py

fix namespace pkg setup

49c1a8f

non-legacy namespace package

c37d95d

# Conflicts: # setup.py

processor: reuse loaded models across pages, use derived images

61bcb43

# Conflicts: # qurator/eynollah/processor.py

check_dpi: fix Pillow type detection

d98fa2a

kba mentioned this pull request Aug 24, 2024

Revert "Merge pull request #97 from qurator-spk/420-namespace-package" #108

Draft

kba force-pushed the v3-api branch from cddbce2 to d98fa2a Compare August 24, 2024 17:19

bertsky approved these changes Aug 24, 2024

View reviewed changes

kba and others added 10 commits August 26, 2024 10:39

processor.py: Simplify import

ecd202e

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

procesor.py: simplify imports further

d26079d

processor: no more DPI info lost

7b92620

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

require ocrd >= 3.0.0b1

aef46a4

setuptools: fix (packages.find.where prevented finding namespace qura…

dfc4ac2

…tor)

undo customizing metadata_filename (not correct with namespace pkg su…

1e90257

…pport in core)

adapt tool json to v3

17eafc1

Merge pull request #134 from bertsky/v3-api

9b274dc

OCR-D v3 API: fixes

Merge branch 'main' into v3-api

f9c2d85

require ocrd>=3.0.0b4

fdedae2

Merge branch 'main' into v3-api

c6e0e05

# Conflicts: # pyproject.toml # src/eynollah/cli.py

cneud added 2 commits October 16, 2024 14:20

relax tf2 requirement to < 2.13

2189391

Update README.md

bc9dddd

bertsky mentioned this pull request Dec 18, 2024

switch to core API 3.0 branches OCR-D/ocrd_all#454

Draft

merge main

869110f

bertsky reviewed Feb 12, 2025

View reviewed changes

kba changed the base branch from main to machine_based_reading_order_integration March 6, 2025 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

port processor to core v3 #130

port processor to core v3 #130

kba commented Aug 23, 2024

bertsky left a comment

bertsky Aug 24, 2024

kba Aug 26, 2024

bertsky Aug 26, 2024

kba Aug 26, 2024

bertsky Feb 12, 2025

bertsky Aug 24, 2024

kba Aug 26, 2024

bertsky Feb 12, 2025

bertsky commented Sep 2, 2024

bertsky left a comment

bertsky Feb 12, 2025

bertsky Feb 12, 2025

bertsky Feb 12, 2025

bertsky Feb 12, 2025

bertsky Feb 12, 2025

bertsky Feb 12, 2025

bertsky Feb 12, 2025

bertsky Feb 12, 2025

	pcgts = input_pcgts[0]
	pcgts = input_pcgts[0]
	result = OcrdPageResult(pcgts)

-        eynollah = Eynollah(
+        if hasattr(page_image, 'filename'):
+            image_filename = page_image.filename
+        else:
+            image_filename = "dummy" # will be replaced by ocrd.Processor.process_page_file
+            result.images.append(OcrdPageResultImage(page_image, '.IMG', page)) # mark as new original
+        eynollah = Eynollah(

	image_filename=page.imageFilename,
	image_filename=image_filename,

+    # already employs background CPU multiprocessing per page
+    # already employs GPU (without singleton process atm)
+    max_workers = 1

port processor to core v3 #130

Are you sure you want to change the base?

port processor to core v3 #130

Conversation

kba commented Aug 23, 2024

bertsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bertsky commented Sep 2, 2024

bertsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment