Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to ocrd core version 3.0.0 #5

Open
wants to merge 102 commits into
base: fix-alpha-shape
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
2ed2c4f
add executable property
MehmedGIT Aug 13, 2024
61e6caf
add setup method if missing
MehmedGIT Aug 13, 2024
a0965c2
add self.logger wherever missing
MehmedGIT Aug 13, 2024
dbccae5
require core >= 3.0.0a1
kba Aug 13, 2024
8557a26
port part of binarize to core v3
kba Aug 13, 2024
911a4c1
Merge pull request #1 from kba/port-to-v3
MehmedGIT Aug 13, 2024
278b706
move: determine_zoom to common.py
MehmedGIT Aug 13, 2024
6beec17
move: logger init to setup()
MehmedGIT Aug 13, 2024
1b2fea3
refactor: log -> logger
MehmedGIT Aug 13, 2024
fe33494
remove: unused imports
MehmedGIT Aug 13, 2024
3368a53
remove: file grp cardinality checks inside process()
MehmedGIT Aug 13, 2024
ae97768
remove: constructors, adapt setup()
MehmedGIT Aug 13, 2024
60d02d2
completed: OcropyBinarize
MehmedGIT Aug 13, 2024
dcaccd4
remove file grp cardinality asserts
MehmedGIT Aug 13, 2024
b178227
Update ocrd_cis/ocropy/binarize.py
MehmedGIT Aug 14, 2024
67b6107
Update ocrd_cis/ocropy/binarize.py
MehmedGIT Aug 14, 2024
06a98b1
Update ocrd_cis/ocropy/binarize.py
MehmedGIT Aug 14, 2024
1e6cd7b
Update ocrd_cis/ocropy/binarize.py
MehmedGIT Aug 14, 2024
71bb26d
fix: potentially wrong dpi in logs
MehmedGIT Aug 14, 2024
64f02a3
binarize: don't conflate region/lines seg, pass output_file_id
kba Aug 14, 2024
d7c15c7
Update binarize.py
MehmedGIT Aug 14, 2024
156d79f
Merge pull request #2 from kba/fix-binarize-v3
MehmedGIT Aug 14, 2024
19566c0
try to migrate recognize
MehmedGIT Aug 14, 2024
5f60976
fix: migrate recognize
MehmedGIT Aug 14, 2024
e8b2603
fix: detect_zoom logging
MehmedGIT Aug 14, 2024
7dfd496
update: test_lib base url
MehmedGIT Aug 14, 2024
033c38a
logging exception -> error
MehmedGIT Aug 14, 2024
46d84d5
refactor: logger as a first positional argument
MehmedGIT Aug 14, 2024
f6fe4cf
fix: test_lib.bash data url
MehmedGIT Aug 14, 2024
aed0f95
fix: recognize OcrdPage import
MehmedGIT Aug 14, 2024
804f031
try to migrate clip
MehmedGIT Aug 14, 2024
7bdff31
remove: process() methods
MehmedGIT Aug 15, 2024
03c2f15
adapt: docstring of process_page_pcgts
MehmedGIT Aug 15, 2024
90ac28e
refactor: other small things
MehmedGIT Aug 15, 2024
f24f86b
fix: determine_zoom
MehmedGIT Aug 15, 2024
5f8e1df
add missing Levenshtein req in setup
MehmedGIT Aug 15, 2024
9a14e1d
fix: remove version req for Levenshtein
MehmedGIT Aug 15, 2024
4ca4d14
fix: Levenshtein import
MehmedGIT Aug 15, 2024
fbaafcb
update ocrd-cis-binarize to be compatible with bertsky/core#8
kba Aug 15, 2024
516ce4b
binarize: use final v3 API
bertsky Aug 15, 2024
2e4f26f
binarize: use correct types
bertsky Aug 15, 2024
21be941
clip: use final v3 API
bertsky Aug 15, 2024
9539ac9
clip: use correct types
bertsky Aug 15, 2024
734b5eb
recognize: use final v3 API
bertsky Aug 15, 2024
28ad585
recognize: fix typing import
bertsky Aug 16, 2024
9a7c10a
denoise: adapt to final v3 API
bertsky Aug 16, 2024
7c9f39f
deskew: adapt to final v3 API
bertsky Aug 16, 2024
6698668
dewarp: adapt to final v3 API
bertsky Aug 16, 2024
48a3146
resegment: adapt to final v3 API
bertsky Aug 16, 2024
0dd6fba
ocropy_segment: implement process_page_pcgts
MehmedGIT Aug 16, 2024
ad5ac7c
ocropy_segment: remove process
MehmedGIT Aug 16, 2024
5d4007b
segment: adapt to final v3 API
bertsky Aug 16, 2024
df1c35c
train: adapt to final v3 API
bertsky Aug 16, 2024
c08b623
ocrd-tool.json: add v3 cardinalities
bertsky Aug 16, 2024
a18307d
fix: ocropy train errors
MehmedGIT Aug 16, 2024
0ba6839
remove: unused imports
MehmedGIT Aug 16, 2024
7b4ebc6
Merge branch 'port-to-v3' into port-to-v3-return-object
MehmedGIT Aug 16, 2024
6b06e88
Update binarize.py
MehmedGIT Aug 16, 2024
6b19f35
Merge pull request #3 from kba/port-to-v3-return-object
MehmedGIT Aug 16, 2024
d1a14b7
refactor: python strings v3
MehmedGIT Aug 16, 2024
d8542c2
spacing: train
MehmedGIT Aug 16, 2024
d785971
spacing: segment
MehmedGIT Aug 16, 2024
7ca78a9
spacing: resegment
MehmedGIT Aug 16, 2024
1004b43
spacing: rest
MehmedGIT Aug 16, 2024
c5498a0
spacing: dewarp
MehmedGIT Aug 16, 2024
31e1245
fix: dewarp return
MehmedGIT Aug 16, 2024
f86c993
improve str speed: precompute element_name_id
MehmedGIT Aug 16, 2024
b8e3ad6
fix: clip suffix
MehmedGIT Aug 16, 2024
02724f2
fix: denoise return
MehmedGIT Aug 16, 2024
aac6fe0
try to fix: ocropy denoise
MehmedGIT Aug 16, 2024
5548d0e
fix: ocropy denoise
MehmedGIT Aug 16, 2024
c9f0f56
fix: resegment
MehmedGIT Aug 16, 2024
fff9097
optimize segment
MehmedGIT Aug 16, 2024
8b92832
optimize ocropy common
MehmedGIT Aug 17, 2024
fceaffe
optimize ocrolib
MehmedGIT Aug 17, 2024
3de2585
optimize align cli
MehmedGIT Aug 17, 2024
0949277
align: use final v3 API
bertsky Aug 22, 2024
d4f8483
use ocrd_utils instead of pkg_resources
bertsky Aug 22, 2024
ecc44c0
postcorrect: use final v3 API
bertsky Aug 22, 2024
2b310b4
revert: ocropy.ocrolib changes
MehmedGIT Aug 23, 2024
4420c6f
revert: ocropy.common changes
MehmedGIT Aug 23, 2024
2d8650e
remove whitespaces in ocropy.common and ocropy.ocrolib
MehmedGIT Aug 23, 2024
9a153b0
postcorrect: adapt to frozendict Processor.parameter in v3
bertsky Aug 25, 2024
bd0613a
require ocrd>=3.0.0b1
bertsky Aug 26, 2024
f6e437f
add: simple github actions workflow
MehmedGIT Aug 27, 2024
403781a
Update .github/workflow/tests.yml
MehmedGIT Aug 27, 2024
97083bb
Update .github/workflow/tests.yml
MehmedGIT Aug 27, 2024
2b20e0c
fix: checkout ref
MehmedGIT Aug 27, 2024
86a08eb
Create GH Actions workflow: test.yml
MehmedGIT Aug 27, 2024
231edf2
Merge branch 'master' into port-to-v3
MehmedGIT Aug 27, 2024
1d7e9a0
delete: wrong path for workflows
MehmedGIT Aug 27, 2024
224e86f
fix: NaN error for python3.9+
MehmedGIT Aug 27, 2024
a397531
fix: NaN in reading_order in morph.py
MehmedGIT Aug 27, 2024
9cf8305
fix type hints
bertsky Sep 1, 2024
a0c734d
dewarp: make thread-safe
bertsky Sep 1, 2024
66baaf0
recognize: disallow multithreading (impossible with current lstm impl…
bertsky Sep 1, 2024
32ce656
postcorrect: make work under METS Server
bertsky Sep 1, 2024
c4a5999
tests: use METS Server if OCRD_MAX_PARALLEL_PAGES>1
bertsky Sep 1, 2024
ae7dc67
make test: run serially and parallel, show times
bertsky Sep 1, 2024
e540b10
require ocrd>=3.0.0b4
bertsky Sep 2, 2024
99b3489
segment: adapt to numpy deprecation
bertsky Sep 26, 2024
dee1abf
eval/stats: Levenshtein -> rapidfuzz.distance.Levenshtein
kba Oct 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
279 changes: 118 additions & 161 deletions ocrd_cis/ocropy/binarize.py

Large diffs are not rendered by default.

56 changes: 21 additions & 35 deletions ocrd_cis/ocropy/clip.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import absolute_import
from logging import Logger

import os.path
from os.path import join
import numpy as np
from PIL import Image, ImageStat, ImageOps
from shapely.geometry import Polygon
Expand All @@ -14,7 +15,6 @@
from ocrd_utils import (
getLogger,
make_file_id,
assert_file_grp_cardinality,
coordinates_of_segment,
polygon_from_points,
bbox_from_polygon,
Expand All @@ -24,22 +24,20 @@
MIMETYPE_PAGE
)

from .. import get_ocrd_tool
from .ocrolib import midrange, morph
from .common import (
# binarize,
pil2array, array2pil
)

TOOL = 'ocrd-cis-ocropy-clip'
array2pil, determine_zoom, pil2array)

class OcropyClip(Processor):
logger: Logger

def __init__(self, *args, **kwargs):
self.ocrd_tool = get_ocrd_tool()
kwargs['ocrd_tool'] = self.ocrd_tool['tools'][TOOL]
kwargs['version'] = self.ocrd_tool['version']
super(OcropyClip, self).__init__(*args, **kwargs)
@property
def executable(self):
return 'ocrd-cis-ocropy-clip'

def setup(self):
self.logger = getLogger('processor.OcropyClip')

def process(self):
"""Clip text regions / lines of the workspace at intersections with neighbours.
Expand Down Expand Up @@ -76,13 +74,10 @@ def process(self):
# too. However, region-level clipping _must_ be run before region-level
# deskewing, because that would make segments incomensurable with their
# neighbours.
LOG = getLogger('processor.OcropyClip')
level = self.parameter['level-of-operation']
assert_file_grp_cardinality(self.input_file_grp, 1)
assert_file_grp_cardinality(self.output_file_grp, 1)

for (n, input_file) in enumerate(self.input_files):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we migrate to process_page_pcgts here, too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, everywhere. For now only binarize is migrated. I am just lagging with the migration since I have no working tests locally. The server from where the resources are downloaded is not available anymore. Hence, I try to adapt only things I understand and I am sure are the right things to do.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Strange that Github pull the older release of our GT. Anyway, with OCR-D/gt_structure_text#2 out of the way it should now suffice to change the base URL in

https://github.com/MehmedGIT/ocrd_cis/blob/156d79fc051abeecf001cd6973e71c18efc659dd/tests/test_lib.bash#L8

to https://github.com/OCR-D/gt_structure_text/releases/tag/v1.5.0/

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again, thanks for being so thorough!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still did not help.

(venv38-core) mm@MM-Notebook:~/repos/ocrd_cis$ make test
bash tests/run_add_zip_test.bash > /dev/null 2>&1
make: *** [Makefile:25: tests/run_add_zip_test.bash] Error 1

To get more detailed errors, I did:

(venv38-core) mm@MM-Notebook:~/repos/ocrd_cis$ bash tests/run_add_zip_test.bash
--2024-08-14 14:47:30--  https://github.com/OCR-D/gt_structure_text/releases/tag/v1.5.0//blumenbach_anatomie_1805.ocrd.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘/home/mm/repos/ocrd_cis/download/blumenbach_anatomie_1805.ocrd.zip’

blumenbach_anatomie_1805.ocrd.zip                   [ <=>                                                                                                   ] 172,14K  --.-KB/s    in 0,08s   

2024-08-14 14:47:30 (2,16 MB/s) - ‘/home/mm/repos/ocrd_cis/download/blumenbach_anatomie_1805.ocrd.zip’ saved [176273]

14:47:31.446 INFO ocrd.workspace_bagger - Spilling /home/mm/repos/ocrd_cis/download/blumenbach_anatomie_1805.ocrd.zip to /tmp/tmp.QcrQegkntf/blumenbach_anatomie_1805
Traceback (most recent call last):
  File "/home/mm/venv38-core/bin/ocrd", line 8, in <module>
...
  File "/home/mm/repos/core/build/__editable__.ocrd-2.67.2-py3-none-any/ocrd_utils/os.py", line 74, in unzip_file_to_dir
    z = ZipFile(path_to_zip, 'r')
  File "/usr/lib/python3.8/zipfile.py", line 1271, in __init__
    self._RealGetContents()
  File "/usr/lib/python3.8/zipfile.py", line 1338, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Seems the download fails for some reason since the size of the zip is much smaller. I have manually downloaded the zip and placed it where it is expected. Hooray, the tests started passing, but then:

(venv38-core) mm@MM-Notebook:~/repos/ocrd_cis$ make test
bash tests/run_add_zip_test.bash > /dev/null 2>&1
bash tests/run_alignment_test.bash > /dev/null 2>&1
make: *** [Makefile:25: tests/run_alignment_test.bash] Error 1

Copy link
Author

@MehmedGIT MehmedGIT Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can you try in a new venv, with just core and ocrd_cis installed?

Yes, I have already done that. There was another other missing import - Levenshtein.
Should I add 'python-Levenshtein>=0.25.1' in the setup.py or was that library supposed to come from core?

After I have manually installed that, I am stuck again on the same error.

(venv38-core-v3) mm@MM-Notebook:~/repos/ocrd_cis$ python --version
Python 3.8.16
(venv38-core-v3) mm@MM-Notebook:~/repos/ocrd_cis$ pip freeze
...
Levenshtein==0.25.1
...
numpy==1.24.4
-e git+ssh://git@github.com/bertsky/core.git@228272b6a4ee94795e8266af4182eacae38e713c#egg=ocrd
ocrd-cis @ file:///home/mm/repos/ocrd_cis
...
(venv38-core-v3) mm@MM-Notebook:~/repos/ocrd_cis$ ocrd --version
ocrd, version 3.0.0a1
(venv38-core-v3) mm@MM-Notebook:~/repos/ocrd_cis$ which ocrd
/home/mm/venv38-core-v3/bin/ocrd

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange because all other tests pass normally although invoking the same method and are supposed to fail. o.0

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed missing Levenshtein and also broken get_ocrd_tool. I'll send a PR for that after finishing bertsky/core#8. And I'll try to reproduce then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do the broken get_ocrd_tool fix in the PR. The missing Levenshtein was fixed by Robert. Just the align import was not replaced.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Indeed, I forgot to update the non-ocropy processors in that regard.

LOG.info("INPUT FILE %i / %s", n, input_file.pageId or input_file.ID)
self.logger.info("INPUT FILE %i / %s", n, input_file.pageId or input_file.ID)
file_id = make_file_id(input_file, self.output_file_grp)

pcgts = page_from_file(self.workspace.download_file(input_file))
Expand All @@ -92,16 +87,8 @@ def process(self):

page_image, page_coords, page_image_info = self.workspace.image_from_page(
page, page_id, feature_selector='binarized')
if self.parameter['dpi'] > 0:
zoom = 300.0/self.parameter['dpi']
elif page_image_info.resolution != 1:
dpi = page_image_info.resolution
if page_image_info.resolutionUnit == 'cm':
dpi *= 2.54
LOG.info('Page "%s" uses %f DPI', page_id, dpi)
zoom = 300.0/dpi
else:
zoom = 1
zoom, dpi = determine_zoom(self.parameter['dpi'], page_image_info)
self.logger.info(f"Page '{page_id}' uses {dpi} DPI. Determined zoom={zoom}")
bertsky marked this conversation as resolved.
Show resolved Hide resolved

# FIXME: what about text regions inside table regions?
regions = list(page.get_TextRegion())
Expand All @@ -120,7 +107,7 @@ def process(self):
page.get_TableRegion() +
page.get_UnknownRegion())
if not num_texts:
LOG.warning('Page "%s" contains no text regions', page_id)
self.logger.warning('Page "%s" contains no text regions', page_id)
background = ImageStat.Stat(page_image)
# workaround for Pillow#4925
if len(background.bands) > 1:
Expand Down Expand Up @@ -151,7 +138,7 @@ def process(self):
if level == 'region':
if region.get_AlternativeImage():
# FIXME: This should probably be an exception (bad workflow configuration).
LOG.warning('Page "%s" region "%s" already contains image data: skipping',
self.logger.warning('Page "%s" region "%s" already contains image data: skipping',
page_id, region.id)
continue
shape = prep(shapes[i])
Expand All @@ -169,7 +156,7 @@ def process(self):
# level == 'line':
lines = region.get_TextLine()
if not lines:
LOG.warning('Page "%s" region "%s" contains no text lines', page_id, region.id)
self.logger.warning('Page "%s" region "%s" contains no text lines', page_id, region.id)
continue
region_image, region_coords = self.workspace.image_from_segment(
region, page_image, page_coords, feature_selector='binarized')
Expand All @@ -187,7 +174,7 @@ def process(self):
for j, line in enumerate(lines):
if line.get_AlternativeImage():
# FIXME: This should probably be an exception (bad workflow configuration).
LOG.warning('Page "%s" region "%s" line "%s" already contains image data: skipping',
self.logger.warning('Page "%s" region "%s" line "%s" already contains image data: skipping',
page_id, region.id, line.id)
continue
shape = prep(shapes[j])
Expand All @@ -203,7 +190,7 @@ def process(self):
input_file.pageId, file_id + '_' + region.id + '_' + line.id)

# update METS (add the PAGE file):
file_path = os.path.join(self.output_file_grp, file_id + '.xml')
file_path = join(self.output_file_grp, file_id + '.xml')
pcgts.set_pcGtsId(file_id)
out = self.workspace.add_file(
ID=file_id,
Expand All @@ -212,13 +199,12 @@ def process(self):
local_filename=file_path,
mimetype=MIMETYPE_PAGE,
content=to_xml(pcgts))
LOG.info('created file ID: %s, file_grp: %s, path: %s',
self.logger.info('created file ID: %s, file_grp: %s, path: %s',
file_id, self.output_file_grp, out.local_filename)

def process_segment(self, segment, segment_mask, segment_polygon, neighbours,
background_image, parent_image, parent_coords, parent_bin,
page_id, file_id):
LOG = getLogger('processor.OcropyClip')
# initialize AlternativeImage@comments classes from parent, except
# for those operations that can apply on multiple hierarchy levels:
features = ','.join(
Expand All @@ -230,7 +216,7 @@ def process_segment(self, segment, segment_mask, segment_polygon, neighbours,
segment_bbox = bbox_from_polygon(segment_polygon)
for neighbour, neighbour_mask in neighbours:
if not np.any(segment_mask > neighbour_mask):
LOG.info('Ignoring enclosing neighbour "%s" of segment "%s" on page "%s"',
self.logger.info('Ignoring enclosing neighbour "%s" of segment "%s" on page "%s"',
neighbour.id, segment.id, page_id)
continue
# find connected components that (only) belong to the neighbour:
Expand All @@ -240,7 +226,7 @@ def process_segment(self, segment, segment_mask, segment_polygon, neighbours,
num_foreground = np.count_nonzero(segment_mask * parent_bin)
if not num_intruders:
continue
LOG.debug('segment "%s" vs neighbour "%s": suppressing %d of %d pixels on page "%s"',
self.logger.debug('segment "%s" vs neighbour "%s": suppressing %d of %d pixels on page "%s"',
segment.id, neighbour.id, num_intruders, num_foreground, page_id)
# suppress in segment_mask so these intruders can stay in the neighbours
# (are not removed from both sides)
Expand Down
14 changes: 13 additions & 1 deletion ocrd_cis/ocropy/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from skimage.morphology import medial_axis
import networkx as nx
from PIL import Image

from ocrd_models import OcrdExif
from . import ocrolib
from .ocrolib import morph, psegutils, sl
# for decorators (type-checks etc):
Expand Down Expand Up @@ -2102,3 +2102,15 @@ def find_topological():
# rlabels[region_hull] = region
# DSAVE('rlabels_closed', rlabels)
return rlabels

def determine_zoom(dpi: float, page_image_info: OcrdExif) -> (float, float):
if dpi > 0:
zoom = 300.0/dpi
elif page_image_info.resolution != 1:
dpi = page_image_info.resolution
if page_image_info.resolutionUnit == 'cm':
dpi *= 2.54
zoom = 300.0/dpi
else:
zoom = 1
return zoom, dpi
53 changes: 20 additions & 33 deletions ocrd_cis/ocropy/denoise.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
from __future__ import absolute_import

import os.path
from logging import Logger
from os.path import join

from ocrd_utils import (
getLogger,
make_file_id,
assert_file_grp_cardinality,
MIMETYPE_PAGE
)
from ocrd_modelfactory import page_from_file
Expand All @@ -14,20 +13,19 @@
)
from ocrd import Processor

from .. import get_ocrd_tool
from .common import (
# binarize,
remove_noise)

TOOL = 'ocrd-cis-ocropy-denoise'
determine_zoom, remove_noise)

class OcropyDenoise(Processor):
logger: Logger

@property
def executable(self):
return 'ocrd-cis-ocropy-denoise'

def __init__(self, *args, **kwargs):
self.ocrd_tool = get_ocrd_tool()
kwargs['ocrd_tool'] = self.ocrd_tool['tools'][TOOL]
kwargs['version'] = self.ocrd_tool['version']
super(OcropyDenoise, self).__init__(*args, **kwargs)
def setup(self):
self.logger = getLogger('processor.OcropyDenoise')

def process(self):
"""Despeckle the pages / regions / lines of the workspace.
Expand All @@ -50,13 +48,10 @@ def process(self):

Produce a new output file by serialising the resulting hierarchy.
"""
LOG = getLogger('processor.OcropyDenoise')
level = self.parameter['level-of-operation']
assert_file_grp_cardinality(self.input_file_grp, 1)
assert_file_grp_cardinality(self.output_file_grp, 1)

for (n, input_file) in enumerate(self.input_files):
MehmedGIT marked this conversation as resolved.
Show resolved Hide resolved
LOG.info("INPUT FILE %i / %s", n, input_file.pageId or input_file.ID)
self.logger.info("INPUT FILE %i / %s", n, input_file.pageId or input_file.ID)
file_id = make_file_id(input_file, self.output_file_grp)

pcgts = page_from_file(self.workspace.download_file(input_file))
Expand All @@ -67,24 +62,17 @@ def process(self):
page_image, page_xywh, page_image_info = self.workspace.image_from_page(
page, page_id,
feature_selector='binarized' if level == 'page' else '')
if self.parameter['dpi'] > 0:
zoom = 300.0/self.parameter['dpi']
elif page_image_info.resolution != 1:
dpi = page_image_info.resolution
if page_image_info.resolutionUnit == 'cm':
dpi *= 2.54
LOG.info('Page "%s" uses %f DPI', page_id, dpi)
zoom = 300.0/dpi
else:
zoom = 1

zoom, dpi = determine_zoom(self.parameter['dpi'], page_image_info)
self.logger.info(f"Page '{page_id}' uses {dpi} DPI. Determined zoom={zoom}")
bertsky marked this conversation as resolved.
Show resolved Hide resolved

if level == 'page':
self.process_segment(page, page_image, page_xywh, zoom,
input_file.pageId, file_id)
else:
regions = page.get_AllRegions(classes=['Text'], order='reading-order')
if not regions:
LOG.warning('Page "%s" contains no text regions', page_id)
self.logger.warning('Page "%s" contains no text regions', page_id)
for region in regions:
region_image, region_xywh = self.workspace.image_from_segment(
region, page_image, page_xywh,
Expand All @@ -95,7 +83,7 @@ def process(self):
continue
lines = region.get_TextLine()
if not lines:
LOG.warning('Page "%s" region "%s" contains no text lines', page_id, region.id)
self.logger.warning('Page "%s" region "%s" contains no text lines', page_id, region.id)
for line in lines:
line_image, line_xywh = self.workspace.image_from_segment(
line, region_image, region_xywh,
Expand All @@ -105,7 +93,7 @@ def process(self):
file_id + '_' + region.id + '_' + line.id)

# update METS (add the PAGE file):
file_path = os.path.join(self.output_file_grp, file_id + '.xml')
file_path = join(self.output_file_grp, file_id + '.xml')
pcgts.set_pcGtsId(file_id)
out = self.workspace.add_file(
ID=file_id,
Expand All @@ -114,15 +102,14 @@ def process(self):
local_filename=file_path,
mimetype=MIMETYPE_PAGE,
content=to_xml(pcgts))
LOG.info('created file ID: %s, file_grp: %s, path: %s',
self.logger.info('created file ID: %s, file_grp: %s, path: %s',
file_id, self.output_file_grp, out.local_filename)

def process_segment(self, segment, segment_image, segment_xywh, zoom, page_id, file_id):
LOG = getLogger('processor.OcropyDenoise')
if not segment_image.width or not segment_image.height:
LOG.warning("Skipping '%s' with zero size", file_id)
self.logger.warning("Skipping '%s' with zero size", file_id)
return
LOG.info("About to despeckle '%s'", file_id)
self.logger.info("About to despeckle '%s'", file_id)
bin_image = remove_noise(segment_image,
maxsize=self.parameter['noise_maxsize']/zoom*300/72) # in pt
# update METS (add the image file):
Expand Down
Loading