Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine based reading order integration #140

Open
wants to merge 110 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
5fdc6d4
integration of machine based reading order detection
vahidrezanezhad Oct 14, 2023
49c9314
machine based reading order inference with a variable batch size
vahidrezanezhad Oct 20, 2023
59c0d90
machine based reading order inference & optimized algorithm
vahidrezanezhad Oct 20, 2023
941d873
machine based reading order & works for not full layout case
vahidrezanezhad Oct 20, 2023
eac18c5
machine based reading order as an argument
vahidrezanezhad Dec 13, 2023
5144668
ocr engine first integration
vahidrezanezhad Jul 17, 2024
a62ae37
new full layout model and early layout for 1&2 column images are inte…
vahidrezanezhad Aug 7, 2024
be144db
updating 1&2 columns images + full layout
vahidrezanezhad Aug 7, 2024
00bf2b6
1&2 column images only printspace
vahidrezanezhad Aug 7, 2024
e976778
testing pyproject.toml
vahidrezanezhad Aug 14, 2024
53fd5fb
resolving #106 for pyproject.toml test
vahidrezanezhad Aug 14, 2024
4c50479
pyproject.toml may work for ocrd
vahidrezanezhad Aug 14, 2024
74eac4d
dtype = object in the case of length 1 arise error
vahidrezanezhad Aug 15, 2024
6f4205b
update pyproject.toml
vahidrezanezhad Aug 15, 2024
4f8210d
update Makefile model location
cneud Aug 15, 2024
c10a525
inference with batch size bigger than 1
vahidrezanezhad Aug 23, 2024
04e7900
making light version faster for 1 and 2 columns images
vahidrezanezhad Aug 24, 2024
7ae6a87
ignoring dpi check by light version
vahidrezanezhad Aug 26, 2024
9300595
inference batch size debugged
vahidrezanezhad Aug 27, 2024
0f87974
writing drop capitals in xml output + and may resolve issue #110
vahidrezanezhad Sep 2, 2024
c3a4a1b
resolving issue #110 in a better way
vahidrezanezhad Sep 3, 2024
f0b4907
adding option for textline detection in printspace
vahidrezanezhad Sep 3, 2024
2c93904
avoiding double binarization
vahidrezanezhad Sep 12, 2024
1b18ae8
passing number of columns as an argument
vahidrezanezhad Sep 12, 2024
21380fc
scaling contours without dilation
vahidrezanezhad Sep 17, 2024
a1f1f98
updating scaling contours
vahidrezanezhad Sep 17, 2024
5a07cd9
the most effective version of contours dilation without opencv and al…
vahidrezanezhad Sep 19, 2024
2d18739
postprocessing of textline contour dilation + skip layout and reading…
vahidrezanezhad Sep 20, 2024
b9e8959
update of light versions
vahidrezanezhad Sep 20, 2024
5d68013
updating light version
vahidrezanezhad Sep 20, 2024
7f08458
dilation of text regions without opencv
vahidrezanezhad Sep 21, 2024
62f8ae4
updating dilation of textlines and text regions
vahidrezanezhad Sep 23, 2024
6626dc6
updating textline dilation parameters
vahidrezanezhad Sep 23, 2024
b33739a
parametriyation in the case of textline contours dilation is accompli…
vahidrezanezhad Sep 24, 2024
95effe5
updating textregions dilation
vahidrezanezhad Sep 25, 2024
1330911
dilation of textregions and marginals are accomplished
vahidrezanezhad Sep 27, 2024
ad32316
updating light version
vahidrezanezhad Sep 27, 2024
1774076
updating light version. Remove textlines or textregion contours insid…
vahidrezanezhad Sep 30, 2024
ab63d5b
updating light version features
vahidrezanezhad Sep 30, 2024
543ed4b
-light version need -tll to be enabled otherwise the process will be …
vahidrezanezhad Oct 2, 2024
1da4b7f
updating light version
vahidrezanezhad Oct 7, 2024
3ef4eac
textlines of textregions are extracted in a faster way + early layout…
vahidrezanezhad Oct 17, 2024
f93fa12
doing more multiprocessing in order to make the process faster
vahidrezanezhad Oct 18, 2024
70772d4
binarization as a standalone command
vahidrezanezhad Oct 21, 2024
328d33e
Temporary commit – textline prediction without patches
vahidrezanezhad Oct 23, 2024
82281bd
fixing a bug occuring with reading order + Slro option with no patch …
vahidrezanezhad Oct 25, 2024
5037e98
Merge branch 'machine_based_reading_order_integration' of https://git…
vahidrezanezhad Oct 25, 2024
90ee2d6
textline segmentation is masked with drop capitals
vahidrezanezhad Oct 28, 2024
438df52
updating
vahidrezanezhad Oct 29, 2024
e796a99
updating inference for early layout in the case of documents with num…
vahidrezanezhad Oct 30, 2024
751b010
updating early layout inference for light version
vahidrezanezhad Nov 5, 2024
f7e5fb9
resolving merge conflict of machine based reading order and extractin…
vahidrezanezhad Nov 5, 2024
bceeeb5
Merge pull request #138 from qurator-spk/extracting_images_only
vahidrezanezhad Nov 5, 2024
6aee70d
Resolve merge conflict of main and machine based reading order branch
vahidrezanezhad Nov 5, 2024
0914b5f
resolve merge conflict of main branch with machine based reading ord…
vahidrezanezhad Nov 5, 2024
8409de0
sbb_binarization is integrated into eynollah works in framework of oc…
vahidrezanezhad Nov 10, 2024
1ae77e6
Update requirements.txt
cneud Nov 11, 2024
22b0b07
drop capital and marginals extraction is updated
vahidrezanezhad Nov 11, 2024
f43c49c
textlines of drop capitals are connected to corresponding textline if…
vahidrezanezhad Nov 13, 2024
ce5b611
tests are passed - new models by the way should be uploaded
vahidrezanezhad Nov 14, 2024
5fa8ca4
updating requirements
vahidrezanezhad Nov 14, 2024
d9f79c3
fixing IndexError by reading order detection
vahidrezanezhad Nov 18, 2024
b622494
new table detection model is integrated
vahidrezanezhad Nov 21, 2024
1746920
Update Makefile
vahidrezanezhad Nov 21, 2024
3000255
Update Makefile
vahidrezanezhad Nov 22, 2024
8014a9e
Update Makefile
vahidrezanezhad Nov 22, 2024
1083d1c
gha: try to free disk space
kba Nov 25, 2024
6aad006
filter textregions without textline
vahidrezanezhad Dec 2, 2024
871d7bf
fixed: machine based reading order cause tuple index out of range err…
vahidrezanezhad Dec 4, 2024
f765e26
move Torch to optional dependencies (to avoid clash with TF over CuDNN)
bertsky Dec 4, 2024
7ae64f3
RO model: do not reload when in dir_in mode
bertsky Dec 4, 2024
3b9a29b
simplify dir_in conditionals
bertsky Dec 4, 2024
329fac2
do not reload enhancement model in dir_in mode, simplify
bertsky Dec 4, 2024
14beb46
simplify loading models w/o dir_in mode
bertsky Dec 4, 2024
9f12fa2
log-level: only set 'eynollah' logger level
bertsky Dec 4, 2024
5b82320
avoid indentation
bertsky Dec 4, 2024
cd4e426
avoid indentation (skip_layout_and_reading_order)
bertsky Dec 4, 2024
a520bd1
wrap extremely long lines
bertsky Dec 4, 2024
3d88b20
run: log instead of print
bertsky Dec 5, 2024
aaea2ef
simplify
bertsky Dec 5, 2024
055463d
avoid indentation
bertsky Dec 5, 2024
c3163ca
avoid indentation
bertsky Dec 5, 2024
ad748d0
do_prediction: avoid code duplication
bertsky Dec 9, 2024
d680170
do_prediction: trigger GC to avoid CUDA OOM
bertsky Dec 9, 2024
6fe02df
do_image_rotation: fix f93fa12 (do return results)
bertsky Dec 9, 2024
54cb150
do_image_rotation / return_deskew_slop: avoid code duplication, simpl…
bertsky Dec 9, 2024
5e0c1da
simplify
bertsky Dec 11, 2024
21efea8
no del on function argument
bertsky Dec 11, 2024
25e9673
exit early if no text regions found (to avoid segfault)
bertsky Dec 11, 2024
68456ea
do_work_of_slopes_new*, do_back_rotation_and_get_cnt_back, do_work_of…
bertsky Dec 11, 2024
7e9ee90
switch from (ad-hoc) mp.Pool to (attribute) concurrent.futures.Proces…
bertsky Dec 11, 2024
3b70b11
avoid deskewing patches if binary-empty
bertsky Dec 11, 2024
9270ea4
annotate region angles in PAGE
bertsky Dec 11, 2024
b9ca7a6
log num_cols-dependent resizing
bertsky Dec 11, 2024
b4b0890
add option to overwrite output xml, but skip by default if file exists
bertsky Dec 11, 2024
dcaf796
change polarity of orientation angle (PAGE schema required cw=positive)
bertsky Dec 11, 2024
e9c0d71
CI: install optional dependencies, too
bertsky Dec 11, 2024
0e8c561
debugging issues
vahidrezanezhad Dec 13, 2024
f93c6c2
function of patch-wise inference with scatter_nd is added
vahidrezanezhad Dec 14, 2024
0ae28f7
switch from stdlib to loky.ProcessPoolExecutor, ensure shutdown
bertsky Dec 14, 2024
fbeef79
adding scatter_nd inference
vahidrezanezhad Dec 16, 2024
92bfac4
Provide OCR as an option to process a directory of XML files, incorpo…
vahidrezanezhad Dec 20, 2024
01376af
do_order_of_regions_with_model: simplify
bertsky Dec 22, 2024
cfc6512
reduce redundancy/indentation
bertsky Dec 22, 2024
335aa27
simplify, wrap extremely long lines
bertsky Dec 23, 2024
33fda2f
changing cnn ocr model name
vahidrezanezhad Dec 26, 2024
25116a2
resolved 2 errors
vahidrezanezhad Feb 18, 2025
7110bd9
resolved an error for light version in the case that slope_deskew is …
vahidrezanezhad Feb 27, 2025
54040c1
Merge remote-tracking branch 'bertsky/machine_based_reading_order_int…
kba Mar 6, 2025
a4f1f35
Resolving test failure
vahidrezanezhad Mar 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .github/workflows/test-eynollah.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,12 @@ jobs:
python-version: ['3.8', '3.9', '3.10', '3.11']

steps:
- name: clean up
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- uses: actions/checkout@v4
- uses: actions/cache@v4
id: model_cache
Expand All @@ -30,7 +36,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .
pip install .[OCR,plotting]
pip install -r requirements-test.txt
- name: Test with pytest
run: make test
Expand Down
6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@ models_eynollah: models_eynollah.tar.gz
models_eynollah.tar.gz:
# wget 'https://qurator-data.de/eynollah/2021-04-25/models_eynollah.tar.gz'
# wget 'https://qurator-data.de/eynollah/2022-04-05/models_eynollah_renamed.tar.gz'
# wget 'https://qurator-data.de/eynollah/2022-04-05/models_eynollah_renamed_savedmodel.tar.gz'
wget 'https://qurator-data.de/eynollah/2022-04-05/models_eynollah.tar.gz'
# wget 'https://github.com/qurator-spk/eynollah/releases/download/v0.3.0/models_eynollah.tar.gz'
wget 'https://github.com/qurator-spk/eynollah/releases/download/v0.3.1/models_eynollah.tar.gz'
# wget 'https://github.com/qurator-spk/eynollah/releases/download/v0.3.1/models_eynollah.tar.gz'

# Install with pip
install:
Expand All @@ -45,7 +45,7 @@ install-dev:
pip install -e .

smoke-test:
eynollah -i tests/resources/kant_aufklaerung_1784_0020.tif -o . -m $(PWD)/models_eynollah
eynollah layout -i tests/resources/kant_aufklaerung_1784_0020.tif -o . -m $(PWD)/models_eynollah

# Run unit tests
test:
Expand Down
5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,14 @@ classifiers = [
"Topic :: Scientific/Engineering :: Image Processing",
]

[project.optional-dependencies]
OCR = ["torch <= 2.0.1", "transformers <= 4.30.2"]
plotting = ["matplotlib"]

[project.scripts]
eynollah = "eynollah.cli:main"
ocrd-eynollah-segment = "eynollah.ocrd_cli:main"
ocrd-sbb-binarize = "eynollah.ocrd_cli_binarization:cli"

[project.urls]
Homepage = "https://github.com/qurator-spk/eynollah"
Expand Down
6 changes: 3 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
ocrd >= 2.23.3
numpy <1.24.0
scikit-learn >= 0.23.2
tensorflow == 2.12.1
tensorflow < 2.13
imutils >= 0.5.3
matplotlib
setuptools >= 50
numba <= 0.58.1
loky
212 changes: 182 additions & 30 deletions src/eynollah/cli.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,108 @@
import sys
import click
from ocrd_utils import initLogging, setOverrideLogLevel
from eynollah.eynollah import Eynollah
from eynollah.eynollah import Eynollah, Eynollah_ocr
from eynollah.sbb_binarize import SbbBinarizer

@click.group()
def main():
pass

@click.command()
@main.command()
@click.option(
"--dir_xml",
"-dx",
help="directory of GT page-xml files",
type=click.Path(exists=True, file_okay=False),
)

@click.option(
"--dir_out_modal_image",
"-domi",
help="directory where ground truth images would be written",
type=click.Path(exists=True, file_okay=False),
)

@click.option(
"--dir_out_classes",
"-docl",
help="directory where ground truth classes would be written",
type=click.Path(exists=True, file_okay=False),
)

@click.option(
"--input_height",
"-ih",
help="input height",
)
@click.option(
"--input_width",
"-iw",
help="input width",
)
@click.option(
"--min_area_size",
"-min",
help="min area size of regions considered for reading order training.",
)

def machine_based_reading_order(dir_xml, dir_out_modal_image, dir_out_classes, input_height, input_width, min_area_size):
xml_files_ind = os.listdir(dir_xml)

@main.command()
@click.option('--patches/--no-patches', default=True, help='by enabling this parameter you let the model to see the image in patches.')

@click.option('--model_dir', '-m', type=click.Path(exists=True, file_okay=False), required=True, help='directory containing models for prediction')

@click.argument('input_image')

@click.argument('output_image')
@click.option(
"--dir_in",
"-di",
help="directory of images",
type=click.Path(exists=True, file_okay=False),
)
@click.option(
"--dir_out",
"-do",
help="directory where the binarized images will be written",
type=click.Path(exists=True, file_okay=False),
)

def binarization(patches, model_dir, input_image, output_image, dir_in, dir_out):
if not dir_out and (dir_in):
print("Error: You used -di but did not set -do")
sys.exit(1)
elif dir_out and not (dir_in):
print("Error: You used -do to write out binarized images but have not set -di")
sys.exit(1)
SbbBinarizer(model_dir).run(image_path=input_image, use_patches=patches, save=output_image, dir_in=dir_in, dir_out=dir_out)




@main.command()
@click.option(
"--image",
"-i",
help="image filename",
type=click.Path(exists=True, dir_okay=False),
)

@click.option(
"--out",
"-o",
help="directory to write output xml data",
type=click.Path(exists=True, file_okay=False),
required=True,
)
@click.option(
"--overwrite",
"-O",
help="overwrite (instead of skipping) if output xml exists",
is_flag=True,
)
@click.option(
"--dir_in",
"-di",
Expand Down Expand Up @@ -140,39 +225,44 @@
help="if this parameter set to true, this tool would ignore page extraction",
)
@click.option(
"--log-level",
"--reading_order_machine_based/--heuristic_reading_order",
"-romb/-hro",
is_flag=True,
help="if this parameter set to true, this tool would apply machine based reading order detection",
)
@click.option(
"--do_ocr",
"-ocr/-noocr",
is_flag=True,
help="if this parameter set to true, this tool will try to do ocr",
)
@click.option(
"--num_col_upper",
"-ncu",
help="lower limit of columns in document image",
)
@click.option(
"--num_col_lower",
"-ncl",
help="upper limit of columns in document image",
)
@click.option(
"--skip_layout_and_reading_order",
"-slro/-noslro",
is_flag=True,
help="if this parameter set to true, this tool will ignore layout detection and reading order. It means that textline detection will be done within printspace and contours of textline will be written in xml output file.",
)
@click.option(
"--log_level",
"-l",
type=click.Choice(['OFF', 'DEBUG', 'INFO', 'WARN', 'ERROR']),
help="Override log level globally to this",
)
def main(
image,
out,
dir_in,
model,
save_images,
save_layout,
save_deskewed,
save_all,
extract_only_images,
save_page,
enable_plotting,
allow_enhancement,
curved_line,
textline_light,
full_layout,
tables,
right2left,
input_binary,
allow_scaling,
headers_off,
light_version,
ignore_page_extraction,
log_level
):
if log_level:
setOverrideLogLevel(log_level)

def layout(image, out, overwrite, dir_in, model, save_images, save_layout, save_deskewed, save_all, extract_only_images, save_page, enable_plotting, allow_enhancement, curved_line, textline_light, full_layout, tables, right2left, input_binary, allow_scaling, headers_off, light_version, reading_order_machine_based, do_ocr, num_col_upper, num_col_lower, skip_layout_and_reading_order, ignore_page_extraction, log_level):
initLogging()
if log_level:
getLogger('eynollah').setLevel(getLevelName(log_level))
if not enable_plotting and (save_layout or save_deskewed or save_all or save_page or save_images or allow_enhancement):
print("Error: You used one of -sl, -sd, -sa, -sp, -si or -ae but did not enable plotting with -ep")
sys.exit(1)
Expand All @@ -182,11 +272,14 @@ def main(
if textline_light and not light_version:
print('Error: You used -tll to enable light textline detection but -light is not enabled')
sys.exit(1)
if light_version and not textline_light:
print('Error: You used -light without -tll. Light version need light textline to be enabled.')
if extract_only_images and (allow_enhancement or allow_scaling or light_version or curved_line or textline_light or full_layout or tables or right2left or headers_off) :
print('Error: You used -eoi which can not be enabled alongside light_version -light or allow_scaling -as or allow_enhancement -ae or curved_line -cl or textline_light -tll or full_layout -fl or tables -tab or right2left -r2l or headers_off -ho')
sys.exit(1)
eynollah = Eynollah(
image_filename=image,
overwrite=overwrite,
dir_out=out,
dir_in=dir_in,
dir_models=model,
Expand All @@ -208,12 +301,71 @@ def main(
headers_off=headers_off,
light_version=light_version,
ignore_page_extraction=ignore_page_extraction,
reading_order_machine_based=reading_order_machine_based,
do_ocr=do_ocr,
num_col_upper=num_col_upper,
num_col_lower=num_col_lower,
skip_layout_and_reading_order=skip_layout_and_reading_order,
)
if dir_in:
eynollah.run()
else:
pcgts = eynollah.run()
eynollah.writer.write_pagexml(pcgts)


@main.command()
@click.option(
"--dir_in",
"-di",
help="directory of images",
type=click.Path(exists=True, file_okay=False),
)
@click.option(
"--out",
"-o",
help="directory to write output xml data",
type=click.Path(exists=True, file_okay=False),
required=True,
)
@click.option(
"--dir_xmls",
"-dx",
help="directory of xmls",
type=click.Path(exists=True, file_okay=False),
)
@click.option(
"--model",
"-m",
help="directory of models",
type=click.Path(exists=True, file_okay=False),
required=True,
)
@click.option(
"--tr_ocr",
"-trocr/-notrocr",
is_flag=True,
help="if this parameter set to true, transformer ocr will be applied, otherwise cnn_rnn model.",
)
@click.option(
"--log_level",
"-l",
type=click.Choice(['OFF', 'DEBUG', 'INFO', 'WARN', 'ERROR']),
help="Override log level globally to this",
)

def ocr(dir_in, out, dir_xmls, model, tr_ocr, log_level):
if log_level:
setOverrideLogLevel(log_level)
initLogging()
eynollah_ocr = Eynollah_ocr(
dir_xmls=dir_xmls,
dir_in=dir_in,
dir_out=out,
dir_models=model,
tr_ocr=tr_ocr,
)
eynollah_ocr.run()

if __name__ == "__main__":
main()
Loading