-
Notifications
You must be signed in to change notification settings - Fork 1
Port to ocrd core version 3.0.0 #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
2ed2c4f
61e6caf
a0965c2
dbccae5
8557a26
911a4c1
278b706
6beec17
1b2fea3
fe33494
3368a53
ae97768
60d02d2
dcaccd4
b178227
67b6107
06a98b1
1e6cd7b
71bb26d
64f02a3
d7c15c7
156d79f
19566c0
5f60976
e8b2603
7dfd496
033c38a
46d84d5
f6fe4cf
aed0f95
804f031
7bdff31
03c2f15
90ac28e
f24f86b
5f8e1df
9a14e1d
4ca4d14
fbaafcb
516ce4b
2e4f26f
21be941
9539ac9
734b5eb
28ad585
9a7c10a
7c9f39f
6698668
48a3146
0dd6fba
ad5ac7c
5d4007b
df1c35c
c08b623
a18307d
0ba6839
7b4ebc6
6b06e88
6b19f35
d1a14b7
d8542c2
d785971
7ca78a9
1004b43
c5498a0
31e1245
f86c993
b8e3ad6
02724f2
aac6fe0
5548d0e
c9f0f56
fff9097
8b92832
fceaffe
3de2585
0949277
d4f8483
ecc44c0
2b310b4
4420c6f
2d8650e
9a153b0
bd0613a
f6e437f
403781a
97083bb
2b20e0c
86a08eb
231edf2
1d7e9a0
224e86f
a397531
9cf8305
a0c734d
66baaf0
32ce656
c4a5999
ae7dc67
e540b10
99b3489
dee1abf
817230b
ec348fc
4cac5da
c022bba
ed8082c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,16 +31,23 @@ | |
pil2array, array2pil | ||
) | ||
|
||
TOOL = 'ocrd-cis-ocropy-clip' | ||
|
||
class OcropyClip(Processor): | ||
|
||
def __init__(self, *args, **kwargs): | ||
self.logger = getLogger('processor.OcropyClip') | ||
self.ocrd_tool = get_ocrd_tool() | ||
kwargs['ocrd_tool'] = self.ocrd_tool['tools'][TOOL] | ||
kwargs['ocrd_tool'] = self.ocrd_tool['tools'][self.executable] | ||
kwargs['version'] = self.ocrd_tool['version'] | ||
super(OcropyClip, self).__init__(*args, **kwargs) | ||
|
||
@property | ||
def executable(self): | ||
return 'ocrd-cis-ocropy-clip' | ||
|
||
def setup(self): | ||
assert_file_grp_cardinality(self.input_file_grp, 1) | ||
assert_file_grp_cardinality(self.output_file_grp, 1) | ||
|
||
def process(self): | ||
"""Clip text regions / lines of the workspace at intersections with neighbours. | ||
|
||
|
@@ -76,13 +83,12 @@ def process(self): | |
# too. However, region-level clipping _must_ be run before region-level | ||
# deskewing, because that would make segments incomensurable with their | ||
# neighbours. | ||
LOG = getLogger('processor.OcropyClip') | ||
level = self.parameter['level-of-operation'] | ||
assert_file_grp_cardinality(self.input_file_grp, 1) | ||
assert_file_grp_cardinality(self.output_file_grp, 1) | ||
|
||
for (n, input_file) in enumerate(self.input_files): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we migrate to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, everywhere. For now only binarize is migrated. I am just lagging with the migration since I have no working tests locally. The server from where the resources are downloaded is not available anymore. Hence, I try to adapt only things I understand and I am sure are the right things to do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops! Strange that Github pull the older release of our GT. Anyway, with OCR-D/gt_structure_text#2 out of the way it should now suffice to change the base URL in to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And again, thanks for being so thorough! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That still did not help. (venv38-core) mm@MM-Notebook:~/repos/ocrd_cis$ make test
bash tests/run_add_zip_test.bash > /dev/null 2>&1
make: *** [Makefile:25: tests/run_add_zip_test.bash] Error 1 To get more detailed errors, I did: (venv38-core) mm@MM-Notebook:~/repos/ocrd_cis$ bash tests/run_add_zip_test.bash
--2024-08-14 14:47:30-- https://github.com/OCR-D/gt_structure_text/releases/tag/v1.5.0//blumenbach_anatomie_1805.ocrd.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘/home/mm/repos/ocrd_cis/download/blumenbach_anatomie_1805.ocrd.zip’
blumenbach_anatomie_1805.ocrd.zip [ <=> ] 172,14K --.-KB/s in 0,08s
2024-08-14 14:47:30 (2,16 MB/s) - ‘/home/mm/repos/ocrd_cis/download/blumenbach_anatomie_1805.ocrd.zip’ saved [176273]
14:47:31.446 INFO ocrd.workspace_bagger - Spilling /home/mm/repos/ocrd_cis/download/blumenbach_anatomie_1805.ocrd.zip to /tmp/tmp.QcrQegkntf/blumenbach_anatomie_1805
Traceback (most recent call last):
File "/home/mm/venv38-core/bin/ocrd", line 8, in <module>
...
File "/home/mm/repos/core/build/__editable__.ocrd-2.67.2-py3-none-any/ocrd_utils/os.py", line 74, in unzip_file_to_dir
z = ZipFile(path_to_zip, 'r')
File "/usr/lib/python3.8/zipfile.py", line 1271, in __init__
self._RealGetContents()
File "/usr/lib/python3.8/zipfile.py", line 1338, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file Seems the download fails for some reason since the size of the zip is much smaller. I have manually downloaded the zip and placed it where it is expected. Hooray, the tests started passing, but then: (venv38-core) mm@MM-Notebook:~/repos/ocrd_cis$ make test
bash tests/run_add_zip_test.bash > /dev/null 2>&1
bash tests/run_alignment_test.bash > /dev/null 2>&1
make: *** [Makefile:25: tests/run_alignment_test.bash] Error 1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I have already done that. There was another other missing import - Levenshtein. After I have manually installed that, I am stuck again on the same error. (venv38-core-v3) mm@MM-Notebook:~/repos/ocrd_cis$ python --version
Python 3.8.16
(venv38-core-v3) mm@MM-Notebook:~/repos/ocrd_cis$ pip freeze
...
Levenshtein==0.25.1
...
numpy==1.24.4
-e git+ssh://git@github.com/bertsky/core.git@228272b6a4ee94795e8266af4182eacae38e713c#egg=ocrd
ocrd-cis @ file:///home/mm/repos/ocrd_cis
...
(venv38-core-v3) mm@MM-Notebook:~/repos/ocrd_cis$ ocrd --version
ocrd, version 3.0.0a1
(venv38-core-v3) mm@MM-Notebook:~/repos/ocrd_cis$ which ocrd
/home/mm/venv38-core-v3/bin/ocrd There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is strange because all other tests pass normally although invoking the same method and are supposed to fail. o.0 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also noticed missing Levenshtein and also broken There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just do the broken There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops! Indeed, I forgot to update the non-ocropy processors in that regard. |
||
LOG.info("INPUT FILE %i / %s", n, input_file.pageId or input_file.ID) | ||
self.logger.info("INPUT FILE %i / %s", n, input_file.pageId or input_file.ID) | ||
file_id = make_file_id(input_file, self.output_file_grp) | ||
|
||
pcgts = page_from_file(self.workspace.download_file(input_file)) | ||
|
@@ -98,7 +104,7 @@ def process(self): | |
dpi = page_image_info.resolution | ||
if page_image_info.resolutionUnit == 'cm': | ||
dpi *= 2.54 | ||
LOG.info('Page "%s" uses %f DPI', page_id, dpi) | ||
self.logger.info('Page "%s" uses %f DPI', page_id, dpi) | ||
zoom = 300.0/dpi | ||
else: | ||
zoom = 1 | ||
|
@@ -120,7 +126,7 @@ def process(self): | |
page.get_TableRegion() + | ||
page.get_UnknownRegion()) | ||
if not num_texts: | ||
LOG.warning('Page "%s" contains no text regions', page_id) | ||
self.logger.warning('Page "%s" contains no text regions', page_id) | ||
background = ImageStat.Stat(page_image) | ||
# workaround for Pillow#4925 | ||
if len(background.bands) > 1: | ||
|
@@ -151,7 +157,7 @@ def process(self): | |
if level == 'region': | ||
if region.get_AlternativeImage(): | ||
# FIXME: This should probably be an exception (bad workflow configuration). | ||
LOG.warning('Page "%s" region "%s" already contains image data: skipping', | ||
self.logger.warning('Page "%s" region "%s" already contains image data: skipping', | ||
page_id, region.id) | ||
continue | ||
shape = prep(shapes[i]) | ||
|
@@ -169,7 +175,7 @@ def process(self): | |
# level == 'line': | ||
lines = region.get_TextLine() | ||
if not lines: | ||
LOG.warning('Page "%s" region "%s" contains no text lines', page_id, region.id) | ||
self.logger.warning('Page "%s" region "%s" contains no text lines', page_id, region.id) | ||
continue | ||
region_image, region_coords = self.workspace.image_from_segment( | ||
region, page_image, page_coords, feature_selector='binarized') | ||
|
@@ -187,7 +193,7 @@ def process(self): | |
for j, line in enumerate(lines): | ||
if line.get_AlternativeImage(): | ||
# FIXME: This should probably be an exception (bad workflow configuration). | ||
LOG.warning('Page "%s" region "%s" line "%s" already contains image data: skipping', | ||
self.logger.warning('Page "%s" region "%s" line "%s" already contains image data: skipping', | ||
page_id, region.id, line.id) | ||
continue | ||
shape = prep(shapes[j]) | ||
|
@@ -212,13 +218,12 @@ def process(self): | |
local_filename=file_path, | ||
mimetype=MIMETYPE_PAGE, | ||
content=to_xml(pcgts)) | ||
LOG.info('created file ID: %s, file_grp: %s, path: %s', | ||
self.logger.info('created file ID: %s, file_grp: %s, path: %s', | ||
file_id, self.output_file_grp, out.local_filename) | ||
|
||
def process_segment(self, segment, segment_mask, segment_polygon, neighbours, | ||
background_image, parent_image, parent_coords, parent_bin, | ||
page_id, file_id): | ||
LOG = getLogger('processor.OcropyClip') | ||
# initialize AlternativeImage@comments classes from parent, except | ||
# for those operations that can apply on multiple hierarchy levels: | ||
features = ','.join( | ||
|
@@ -230,7 +235,7 @@ def process_segment(self, segment, segment_mask, segment_polygon, neighbours, | |
segment_bbox = bbox_from_polygon(segment_polygon) | ||
for neighbour, neighbour_mask in neighbours: | ||
if not np.any(segment_mask > neighbour_mask): | ||
LOG.info('Ignoring enclosing neighbour "%s" of segment "%s" on page "%s"', | ||
self.logger.info('Ignoring enclosing neighbour "%s" of segment "%s" on page "%s"', | ||
neighbour.id, segment.id, page_id) | ||
continue | ||
# find connected components that (only) belong to the neighbour: | ||
|
@@ -240,7 +245,7 @@ def process_segment(self, segment, segment_mask, segment_polygon, neighbours, | |
num_foreground = np.count_nonzero(segment_mask * parent_bin) | ||
if not num_intruders: | ||
continue | ||
LOG.debug('segment "%s" vs neighbour "%s": suppressing %d of %d pixels on page "%s"', | ||
self.logger.debug('segment "%s" vs neighbour "%s": suppressing %d of %d pixels on page "%s"', | ||
segment.id, neighbour.id, num_intruders, num_foreground, page_id) | ||
# suppress in segment_mask so these intruders can stay in the neighbours | ||
# (are not removed from both sides) | ||
|
Uh oh!
There was an error while loading. Please reload this page.