From 48237301d06fa06d474360e8adcd02f86c01ba34 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Fri, 2 Apr 2021 04:46:22 +0000 Subject: [PATCH 01/30] Bump pillow from 8.1.0 to 8.2.0 in /code-env/python/spec Bumps [pillow](https://github.com/python-pillow/Pillow) from 8.1.0 to 8.2.0. - [Release notes](https://github.com/python-pillow/Pillow/releases) - [Changelog](https://github.com/python-pillow/Pillow/blob/master/CHANGES.rst) - [Commits](https://github.com/python-pillow/Pillow/compare/8.1.0...8.2.0) Signed-off-by: dependabot-preview[bot] --- code-env/python/spec/requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index a065f6b..7162716 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -1,6 +1,6 @@ pdf2image==1.14.0 pytesseract==0.3.7 -Pillow==8.1.0 +Pillow==8.2.0 matplotlib==3.3.4 opencv-python==4.5.1.48 deskew==0.10.3 From b8373b41fbfbadcdfd7b91db94e1be8fb63ff0db Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Mon, 10 May 2021 04:52:35 +0000 Subject: [PATCH 02/30] Bump matplotlib from 3.3.4 to 3.4.2 in /code-env/python/spec Bumps [matplotlib](https://github.com/matplotlib/matplotlib) from 3.3.4 to 3.4.2. - [Release notes](https://github.com/matplotlib/matplotlib/releases) - [Commits](https://github.com/matplotlib/matplotlib/compare/v3.3.4...v3.4.2) Signed-off-by: dependabot-preview[bot] --- code-env/python/spec/requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index a065f6b..4fca0ab 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -1,6 +1,6 @@ pdf2image==1.14.0 pytesseract==0.3.7 Pillow==8.1.0 -matplotlib==3.3.4 +matplotlib==3.4.2 opencv-python==4.5.1.48 deskew==0.10.3 From 8271de4cfa4d2617971949519e5632d6f75a8be4 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Wed, 23 Jun 2021 04:45:07 +0000 Subject: [PATCH 03/30] Bump pdf2image from 1.14.0 to 1.16.0 in /code-env/python/spec Bumps [pdf2image](https://github.com/Belval/pdf2image) from 1.14.0 to 1.16.0. - [Release notes](https://github.com/Belval/pdf2image/releases) - [Commits](https://github.com/Belval/pdf2image/compare/v1.14.0...v1.16.0) Signed-off-by: dependabot-preview[bot] --- code-env/python/spec/requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index a065f6b..212bdfd 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -1,4 +1,4 @@ -pdf2image==1.14.0 +pdf2image==1.16.0 pytesseract==0.3.7 Pillow==8.1.0 matplotlib==3.3.4 From 9608999de6baaf4f628d0ab772fa3eda924d382f Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Tue, 29 Jun 2021 04:44:46 +0000 Subject: [PATCH 04/30] Bump pytesseract from 0.3.7 to 0.3.8 in /code-env/python/spec Bumps [pytesseract](https://github.com/madmaze/pytesseract) from 0.3.7 to 0.3.8. - [Release notes](https://github.com/madmaze/pytesseract/releases) - [Commits](https://github.com/madmaze/pytesseract/compare/v0.3.7...v0.3.8) Signed-off-by: dependabot-preview[bot] --- code-env/python/spec/requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index a065f6b..0f4a1a2 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -1,5 +1,5 @@ pdf2image==1.14.0 -pytesseract==0.3.7 +pytesseract==0.3.8 Pillow==8.1.0 matplotlib==3.3.4 opencv-python==4.5.1.48 From 74b891cad246aa3da405823e93a293f5f9225579 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Mon, 12 Jul 2021 04:42:21 +0000 Subject: [PATCH 05/30] Bump opencv-python from 4.5.1.48 to 4.5.3.56 in /code-env/python/spec Bumps [opencv-python](https://github.com/skvark/opencv-python) from 4.5.1.48 to 4.5.3.56. - [Release notes](https://github.com/skvark/opencv-python/releases) - [Commits](https://github.com/skvark/opencv-python/commits) Signed-off-by: dependabot-preview[bot] --- code-env/python/spec/requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index a065f6b..900d874 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -2,5 +2,5 @@ pdf2image==1.14.0 pytesseract==0.3.7 Pillow==8.1.0 matplotlib==3.3.4 -opencv-python==4.5.1.48 +opencv-python==4.5.3.56 deskew==0.10.3 From ca70323c18c8194c319805d09904e68b3f19d6af Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Mon, 2 Aug 2021 04:43:11 +0000 Subject: [PATCH 06/30] Bump deskew from 0.10.3 to 0.10.33 in /code-env/python/spec Bumps [deskew](https://github.com/sbrunner/deskew) from 0.10.3 to 0.10.33. - [Release notes](https://github.com/sbrunner/deskew/releases) - [Commits](https://github.com/sbrunner/deskew/commits) Signed-off-by: dependabot-preview[bot] --- code-env/python/spec/requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index a065f6b..d60820e 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -3,4 +3,4 @@ pytesseract==0.3.7 Pillow==8.1.0 matplotlib==3.3.4 opencv-python==4.5.1.48 -deskew==0.10.3 +deskew==0.10.33 From 5e7cc1f54c73eaeaf1f9fc8192113272d5aba836 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Mon, 19 Jun 2023 17:39:08 +0200 Subject: [PATCH 07/30] add easyocr + accept pdf --- code-env/python/spec/requirements.txt | 1 + custom-recipes/image-conversion/recipe.json | 5 +- .../image-processing-custom/recipe.json | 5 +- .../ocr-text-extraction-dataset/recipe.json | 32 +++++++++++-- .../ocr-text-extraction-dataset/recipe.py | 46 ++++++++++++++----- plugin.json | 8 ++-- python-lib/constants.py | 5 +- python-lib/tesseractocr/extract_text.py | 18 ++++++-- python-lib/utils.py | 15 +++--- 9 files changed, 100 insertions(+), 35 deletions(-) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index a065f6b..9471b37 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -4,3 +4,4 @@ Pillow==8.1.0 matplotlib==3.3.4 opencv-python==4.5.1.48 deskew==0.10.3 +easyocr==1.7.0 \ No newline at end of file diff --git a/custom-recipes/image-conversion/recipe.json b/custom-recipes/image-conversion/recipe.json index ad5cb69..461c078 100644 --- a/custom-recipes/image-conversion/recipe.json +++ b/custom-recipes/image-conversion/recipe.json @@ -1,8 +1,9 @@ { "meta": { - "label": "OCR - Image conversion", + "label": "Image conversion", "description": "Convert PDF/JPG/JPEG/PNG/TIFF files into greyscale JPG images. Split multipage PDF into multiple images as well.", - "icon": "icon-picture" + "icon": "icon-picture", + "displayOrderRank": 2 }, "kind": "PYTHON", diff --git a/custom-recipes/image-processing-custom/recipe.json b/custom-recipes/image-processing-custom/recipe.json index c82eb55..bdadee9 100644 --- a/custom-recipes/image-processing-custom/recipe.json +++ b/custom-recipes/image-processing-custom/recipe.json @@ -1,8 +1,9 @@ { "meta": { - "label": "OCR - Image processing", + "label": "Image processing", "description": "For advanced users. Use the notebook template to find good image processing functions for your images. All images in the input folder are processed using the image processing functions specified by the user in the recipe parameter's form. Inputs must be greyscale JPG images.", - "icon": "icon-cogs" + "icon": "icon-cogs", + "displayOrderRank": 3 }, "kind": "PYTHON", diff --git a/custom-recipes/ocr-text-extraction-dataset/recipe.json b/custom-recipes/ocr-text-extraction-dataset/recipe.json index f2a193e..277e282 100644 --- a/custom-recipes/ocr-text-extraction-dataset/recipe.json +++ b/custom-recipes/ocr-text-extraction-dataset/recipe.json @@ -2,7 +2,8 @@ "meta": { "label": "OCR - Text extraction", "description": "Extract text from JPG images into a dataset of filename and text.", - "icon": "icon-file-text-alt" + "icon": "icon-file-text-alt", + "displayOrderRank": 1 }, "kind": "PYTHON", @@ -31,7 +32,6 @@ "acceptsDataset": true } ], - "params": [ { "name": "recombine_pdf", @@ -39,6 +39,23 @@ "type": "BOOLEAN", "description": "Text of images that are from the same original multiple-page PDF (images with name pattern _pdf_page_XXXXX.jpg) are concatenated." }, + { + "name": "ocr_engine", + "label": "OCR Engine", + "type": "SELECT", + "mandatory": true, + "defaultValue": "tesseract", + "selectChoices": [ + { + "value": "tesseract", + "label": "Tesseract" + }, + { + "value": "easyocr", + "label": "EasyOCR" + } + ] + }, { "name": "advanced_parameters", "label" : "Advanced preprocessing parameters", @@ -48,8 +65,15 @@ "name": "language", "label": "Specify language", "type": "STRING", - "description": "Enter language code found at https://tesseract-ocr.github.io/tessdoc/Data-Files. Languages must be installed beforehand.", - "visibilityCondition" : "model.advanced_parameters" + "description": "Enter language code found at https://tesseract-ocr.github.io/tessdoc/Data-Files. Languages must be installed beforehand", + "visibilityCondition" : "model.advanced_parameters && model.ocr_engine == 'tesseract'" + }, + { + "name": "language_easyocr", + "label": "Specify language", + "type": "STRING", + "description": "Enter language code found at https://www.jaided.ai/easyocr/.", + "visibilityCondition" : "model.advanced_parameters && model.ocr_engine == 'easyocr'" } ], diff --git a/custom-recipes/ocr-text-extraction-dataset/recipe.py b/custom-recipes/ocr-text-extraction-dataset/recipe.py index 2165f19..551d340 100644 --- a/custom-recipes/ocr-text-extraction-dataset/recipe.py +++ b/custom-recipes/ocr-text-extraction-dataset/recipe.py @@ -1,9 +1,16 @@ -from dataiku.customrecipe import get_recipe_config import logging -from utils import get_input_output, text_extraction_parameters -from tesseractocr.extract_text import text_extraction import pandas as pd +from pdf2image import convert_from_bytes +from time import perf_counter +import re + from constants import Constants +from dataiku.customrecipe import get_recipe_config +from tesseractocr.extract_text import text_extraction +from utils import convert_image_to_greyscale_bytes +from utils import get_input_output +from utils import text_extraction_parameters + logger = logging.getLogger(__name__) @@ -14,25 +21,42 @@ input_filenames = input_folder.list_paths_in_partition() total_images = len(input_filenames) -df = pd.DataFrame() +rows = [] for i, sample_file in enumerate(input_filenames): - if sample_file.split('.')[-1] != "jpg": - logger.info("OCR - Rejecting {} because it is not a JPG file.".format(sample_file)) + prefix = sample_file.split('.')[0] + suffix = sample_file.split('.')[-1] + + if suffix not in Constants.TYPES: + logger.info("OCR - Rejecting {} because it is not a {} file.".format(sample_file, '/'.join(Constants.TYPES))) logger.info("OCR - Rejected {}/{} images".format(i+1, total_images)) continue with input_folder.get_download_stream(sample_file) as stream: img_bytes = stream.read() - img_text = text_extraction(img_bytes, params) - logger.info("OCR - Extracted text from {}/{} images".format(i+1, total_images)) + start = perf_counter() + + if suffix == "pdf": + pdf_images = convert_from_bytes(img_bytes, fmt='jpg') + for j, img in enumerate(pdf_images): + img_bytes = convert_image_to_greyscale_bytes(img) + img_text = text_extraction(img_bytes, params) + + pdf_image_name = "{}{}{:05d}".format(prefix, Constants.PDF_MULTI_SUFFIX, j+1) + rows.append({'file': pdf_image_name, 'text': img_text}) + else: + img_text = text_extraction(img_bytes, params) + rows.append({'file': prefix, 'text': img_text}) + + logger.info("OCR - Extracted text from {}/{} files (in {:.2f} seconds)".format(i+1, total_images, perf_counter() - start)) - df = df.append({'file': sample_file.split('/')[-1].split('.')[0], 'text': img_text}, ignore_index=True) +df = pd.DataFrame(rows) if params['recombine_pdf']: - df['page_nb'] = df.apply(lambda row: int(row['file'].split(Constants.PDF_MULTI_SUFFIX)[1]) if Constants.PDF_MULTI_SUFFIX in row['file'] else 1, axis=1) - df['file'] = df.apply(lambda row: row['file'].split(Constants.PDF_MULTI_SUFFIX)[0] if Constants.PDF_MULTI_SUFFIX in row['file'] else row['file'], axis=1) + pdf_multi_page_pattern = "^.*{}\d{{5}}$".format(Constants.PDF_MULTI_SUFFIX) + df['page_nb'] = df.apply(lambda row: int(row['file'].split(Constants.PDF_MULTI_SUFFIX)[1]) if re.match(pdf_multi_page_pattern, row['file']) else 1, axis=1) + df['file'] = df.apply(lambda row: row['file'].split(Constants.PDF_MULTI_SUFFIX)[0] if re.match(pdf_multi_page_pattern, row['file']) else row['file'], axis=1) df = df.sort_values(['file', 'page_nb'], ascending=True) diff --git a/plugin.json b/plugin.json index e8e4fe8..6991fa8 100644 --- a/plugin.json +++ b/plugin.json @@ -1,10 +1,10 @@ { "id": "tesseract-ocr", - "version": "1.0.2", + "version": "2.0.0", "meta": { - "label": "Tesseract OCR", - "description": "Extract text from images using the Tesseract Optical Character Recognition (OCR) engine", - "author": "Dataiku (Stanislas GUINEL)", + "label": "OCR", + "description": "Extract text from images using OCR engines", + "author": "Dataiku", "icon": "icon-file-text-alt", "tags": [ "NLP", diff --git a/python-lib/constants.py b/python-lib/constants.py index 38ec1d1..2173a83 100644 --- a/python-lib/constants.py +++ b/python-lib/constants.py @@ -7,4 +7,7 @@ class Constants: QUALITY = "quality" RECOMBINE_PDF = "recombine_pdf" LANGUAGE = "language" - DEFAULT_LANGUAGE = "eng" + OCR_ENGINE = "ocr_engine" + EASYOCR = "easyocr" + TESSERACT = "tesseract" + EASYOCR_READER = "easyocr_reader" diff --git a/python-lib/tesseractocr/extract_text.py b/python-lib/tesseractocr/extract_text.py index 35f0b5c..e723f7b 100644 --- a/python-lib/tesseractocr/extract_text.py +++ b/python-lib/tesseractocr/extract_text.py @@ -19,10 +19,18 @@ def text_extraction(img_bytes, params): logger.info("OCR - converting image to greyscale.") img = img.convert('L') - try: - img = np.array(img) - img_text = pytesseract.image_to_string(img, lang=params[Constants.LANGUAGE]) - except Exception as e: - raise Exception("OCR - Error calling pytesseract: {}".format(e)) + if params[Constants.OCR_ENGINE] == Constants.TESSERACT: + try: + img = np.array(img) + img_text = pytesseract.image_to_string(img, lang=params[Constants.LANGUAGE]) + except Exception as e: + raise Exception("OCR - Error calling pytesseract: {}".format(e)) + elif params[Constants.OCR_ENGINE] == Constants.EASYOCR: + try: + img = np.array(img) + reader = params[Constants.EASYOCR_READER] + img_text = " ".join(reader.readtext(img_bytes, detail=0)) + except Exception as e: + raise Exception("OCR - Error calling easyocr: {}".format(e)) return img_text diff --git a/python-lib/utils.py b/python-lib/utils.py index 8532ebe..f6c06fd 100644 --- a/python-lib/utils.py +++ b/python-lib/utils.py @@ -22,7 +22,7 @@ def get_input_output(input_type='dataset', output_type='dataset'): return input_obj, output_obj -def convert_image_to_greyscale_bytes(img, quality): +def convert_image_to_greyscale_bytes(img, quality=75): """ convert a PIL image to greyscale with a specified dpi and output image as bytes """ img = img.convert('L') buf = BytesIO() @@ -56,10 +56,13 @@ def text_extraction_parameters(recipe_config): """ retrieve text extraction recipe parameters """ params = {} params[Constants.RECOMBINE_PDF] = recipe_config.get(Constants.RECOMBINE_PDF, False) - params['advanced'] = recipe_config.get('advanced_parameters', False) - if params['advanced']: - params[Constants.LANGUAGE] = recipe_config.get(Constants.LANGUAGE, Constants.DEFAULT_LANGUAGE) - else: - params[Constants.LANGUAGE] = Constants.DEFAULT_LANGUAGE + params[Constants.OCR_ENGINE] = recipe_config.get(Constants.OCR_ENGINE, Constants.TESSERACT) + advanced = recipe_config.get('advanced_parameters', False) + if params[Constants.OCR_ENGINE] == Constants.TESSERACT: + params[Constants.LANGUAGE] = recipe_config.get(Constants.LANGUAGE, "eng") if advanced else "eng" + elif params[Constants.OCR_ENGINE] == Constants.EASYOCR: + import easyocr + language = recipe_config.get(Constants.LANGUAGE, "en") if advanced else "en" + params[Constants.EASYOCR_READER] = easyocr.Reader([language]) return params From 690e21ef2893af2e4457cbe2f586ce913f23c5a5 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Tue, 20 Jun 2023 17:34:03 +0200 Subject: [PATCH 08/30] add a default engine --- custom-recipes/image-conversion/recipe.py | 4 +-- .../image-processing-custom/recipe.py | 4 +-- .../ocr-text-extraction-dataset/recipe.json | 19 ++++------- .../ocr-text-extraction-dataset/recipe.py | 8 ++--- python-lib/{constants.py => ocr_constants.py} | 3 +- python-lib/{utils.py => ocr_utils.py} | 19 +++++++++-- python-lib/tesseractocr/extract_text.py | 6 ++-- resource/select_ocr_engine.py | 32 +++++++++++++++++++ 8 files changed, 69 insertions(+), 26 deletions(-) rename python-lib/{constants.py => ocr_constants.py} (92%) rename python-lib/{utils.py => ocr_utils.py} (81%) create mode 100644 resource/select_ocr_engine.py diff --git a/custom-recipes/image-conversion/recipe.py b/custom-recipes/image-conversion/recipe.py index ab1cb7f..10448e4 100644 --- a/custom-recipes/image-conversion/recipe.py +++ b/custom-recipes/image-conversion/recipe.py @@ -3,8 +3,8 @@ from PIL import Image from io import BytesIO import logging -from utils import get_input_output, convert_image_to_greyscale_bytes, image_conversion_parameters -from constants import Constants +from ocr_utils import get_input_output, convert_image_to_greyscale_bytes, image_conversion_parameters +from ocr_constants import Constants logger = logging.getLogger(__name__) diff --git a/custom-recipes/image-processing-custom/recipe.py b/custom-recipes/image-processing-custom/recipe.py index d00bba7..347a0ac 100644 --- a/custom-recipes/image-processing-custom/recipe.py +++ b/custom-recipes/image-processing-custom/recipe.py @@ -3,8 +3,8 @@ from io import BytesIO import numpy as np import logging -from utils import get_input_output, image_processing_parameters -from constants import Constants +from ocr_utils import get_input_output, image_processing_parameters +from ocr_constants import Constants logger = logging.getLogger(__name__) diff --git a/custom-recipes/ocr-text-extraction-dataset/recipe.json b/custom-recipes/ocr-text-extraction-dataset/recipe.json index 277e282..8081a5d 100644 --- a/custom-recipes/ocr-text-extraction-dataset/recipe.json +++ b/custom-recipes/ocr-text-extraction-dataset/recipe.json @@ -1,7 +1,7 @@ { "meta": { "label": "OCR - Text extraction", - "description": "Extract text from JPG images into a dataset of filename and text.", + "description": "Extract text from PDF/JPG/JPEG/PNG/TIFF files into a dataset of filename and text.", "icon": "icon-file-text-alt", "displayOrderRank": 1 }, @@ -32,29 +32,22 @@ "acceptsDataset": true } ], + "paramsPythonSetup": "select_ocr_engine.py", "params": [ { "name": "recombine_pdf", "label" : "Recombine multiple-page PDF together", "type": "BOOLEAN", - "description": "Text of images that are from the same original multiple-page PDF (images with name pattern _pdf_page_XXXXX.jpg) are concatenated." + "description": "Multiple-page PDFs and images with name pattern $FILENAME_pdf_page_XXXXX.jpg are extracted into a single text." }, { "name": "ocr_engine", "label": "OCR Engine", "type": "SELECT", "mandatory": true, - "defaultValue": "tesseract", - "selectChoices": [ - { - "value": "tesseract", - "label": "Tesseract" - }, - { - "value": "easyocr", - "label": "EasyOCR" - } - ] + "description": "", + "defaultValue": "default", + "getChoicesFromPython": true }, { "name": "advanced_parameters", diff --git a/custom-recipes/ocr-text-extraction-dataset/recipe.py b/custom-recipes/ocr-text-extraction-dataset/recipe.py index 551d340..4e620b2 100644 --- a/custom-recipes/ocr-text-extraction-dataset/recipe.py +++ b/custom-recipes/ocr-text-extraction-dataset/recipe.py @@ -4,12 +4,12 @@ from time import perf_counter import re -from constants import Constants +from ocr_constants import Constants from dataiku.customrecipe import get_recipe_config from tesseractocr.extract_text import text_extraction -from utils import convert_image_to_greyscale_bytes -from utils import get_input_output -from utils import text_extraction_parameters +from ocr_utils import convert_image_to_greyscale_bytes +from ocr_utils import get_input_output +from ocr_utils import text_extraction_parameters logger = logging.getLogger(__name__) diff --git a/python-lib/constants.py b/python-lib/ocr_constants.py similarity index 92% rename from python-lib/constants.py rename to python-lib/ocr_constants.py index 2173a83..2aa8d48 100644 --- a/python-lib/constants.py +++ b/python-lib/ocr_constants.py @@ -8,6 +8,7 @@ class Constants: RECOMBINE_PDF = "recombine_pdf" LANGUAGE = "language" OCR_ENGINE = "ocr_engine" - EASYOCR = "easyocr" + DEFAULT_ENGINE = "default" TESSERACT = "tesseract" + EASYOCR = "easyocr" EASYOCR_READER = "easyocr_reader" diff --git a/python-lib/utils.py b/python-lib/ocr_utils.py similarity index 81% rename from python-lib/utils.py rename to python-lib/ocr_utils.py index f6c06fd..7fd3033 100644 --- a/python-lib/utils.py +++ b/python-lib/ocr_utils.py @@ -1,7 +1,8 @@ import dataiku from dataiku.customrecipe import get_input_names_for_role, get_output_names_for_role from io import BytesIO -from constants import Constants +from ocr_constants import Constants +from shutil import which def get_input_output(input_type='dataset', output_type='dataset'): @@ -56,7 +57,7 @@ def text_extraction_parameters(recipe_config): """ retrieve text extraction recipe parameters """ params = {} params[Constants.RECOMBINE_PDF] = recipe_config.get(Constants.RECOMBINE_PDF, False) - params[Constants.OCR_ENGINE] = recipe_config.get(Constants.OCR_ENGINE, Constants.TESSERACT) + params[Constants.OCR_ENGINE] = _get_ocr_engine(recipe_config) advanced = recipe_config.get('advanced_parameters', False) if params[Constants.OCR_ENGINE] == Constants.TESSERACT: params[Constants.LANGUAGE] = recipe_config.get(Constants.LANGUAGE, "eng") if advanced else "eng" @@ -66,3 +67,17 @@ def text_extraction_parameters(recipe_config): params[Constants.EASYOCR_READER] = easyocr.Reader([language]) return params + + +def _get_ocr_engine(recipe_config): + selected_ocr_engine = recipe_config.get(Constants.OCR_ENGINE, Constants.DEFAULT_ENGINE) + if selected_ocr_engine == Constants.DEFAULT_ENGINE: + return get_default_ocr_engine() + else: + return selected_ocr_engine + + +def get_default_ocr_engine(): + if which("tesseract") is not None: # check if tesseract is in the path + return Constants.TESSERACT + return Constants.EASYOCR diff --git a/python-lib/tesseractocr/extract_text.py b/python-lib/tesseractocr/extract_text.py index e723f7b..b4d961e 100644 --- a/python-lib/tesseractocr/extract_text.py +++ b/python-lib/tesseractocr/extract_text.py @@ -3,14 +3,14 @@ import numpy as np import pytesseract import logging -from constants import Constants +from ocr_constants import Constants logger = logging.getLogger(__name__) def text_extraction(img_bytes, params): """ - extract text from bytes images using pytesseract (with specified language) + extract text from bytes images using the selected OCR engine (with specified language) """ img = Image.open(BytesIO(img_bytes)) @@ -32,5 +32,7 @@ def text_extraction(img_bytes, params): img_text = " ".join(reader.readtext(img_bytes, detail=0)) except Exception as e: raise Exception("OCR - Error calling easyocr: {}".format(e)) + else: + raise NotImplementedError("OCR engine {} not implemented".format(params[Constants.OCR_ENGINE])) return img_text diff --git a/resource/select_ocr_engine.py b/resource/select_ocr_engine.py new file mode 100644 index 0000000..58ee115 --- /dev/null +++ b/resource/select_ocr_engine.py @@ -0,0 +1,32 @@ +from ocr_utils import get_default_ocr_engine +from ocr_constants import Constants + + +OCR_ENGINES = { + Constants.TESSERACT: "Tesseract", + Constants.EASYOCR: "EasyOCR" +} + + +def do(payload, config, plugin_config, inputs): + """ + Retrieve a list of OCR engines including a default engine that points to an available engine. + """ + choices = [] + if payload.get("parameterName") == Constants.OCR_ENGINE: + default_ocr_engine = get_default_ocr_engine() + choices.append({ + "label": "Default ({})".format(OCR_ENGINES[default_ocr_engine]), + "value": "default" + }) + + if default_ocr_engine != Constants.TESSERACT: + OCR_ENGINES[Constants.TESSERACT] += " (not installed)" + + for engine_value, engine_label in OCR_ENGINES.items(): + choices.append({ + "label": engine_label, + "value": engine_value + }) + + return {"choices": choices} From 922846f6031caeceb3b26f860e09b166734d91b3 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Tue, 20 Jun 2023 17:36:02 +0200 Subject: [PATCH 09/30] update version --- plugin.json | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/plugin.json b/plugin.json index 6991fa8..572275d 100644 --- a/plugin.json +++ b/plugin.json @@ -1,6 +1,6 @@ { "id": "tesseract-ocr", - "version": "2.0.0", + "version": "1.1.0", "meta": { "label": "OCR", "description": "Extract text from images using OCR engines", From dee7b0c69c79fd5feed16a5419d1606a12bbfaa2 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Tue, 20 Jun 2023 17:41:31 +0200 Subject: [PATCH 10/30] edit changelog --- CHANGELOG.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index b8e7605..8a7d570 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,11 @@ # Changelog +## [Version 1.1.0](https://github.com/dataiku/dss-plugin-tesseract-ocr/releases/tag/v1.1.0) - Feature release - 2023-06 + +- Add EasyOCR +- Check if tesseract is installed before running the recipe +- Support of PDFs in the text extraction recipe + ## [Version 1.0.2](https://github.com/dataiku/dss-plugin-tesseract-ocr/releases/tag/v1.0.2) - Initial release - 2021-03 - Fix an error in python 37 in the text extraction recipe From 68421c88ab9ffac1bbff9c75fe42d368a5ac1c20 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Wed, 21 Jun 2023 15:57:14 +0200 Subject: [PATCH 11/30] update version and changelog --- CHANGELOG.md | 3 ++- plugin.json | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 8a7d570..b69a072 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,10 +1,11 @@ # Changelog -## [Version 1.1.0](https://github.com/dataiku/dss-plugin-tesseract-ocr/releases/tag/v1.1.0) - Feature release - 2023-06 +## [Version 2.0.0](https://github.com/dataiku/dss-plugin-tesseract-ocr/releases/tag/v2.0.0) - Feature release - 2023-06 - Add EasyOCR - Check if tesseract is installed before running the recipe - Support of PDFs in the text extraction recipe +- Remove "Tesseract" from the plugin name ## [Version 1.0.2](https://github.com/dataiku/dss-plugin-tesseract-ocr/releases/tag/v1.0.2) - Initial release - 2021-03 diff --git a/plugin.json b/plugin.json index 572275d..6991fa8 100644 --- a/plugin.json +++ b/plugin.json @@ -1,6 +1,6 @@ { "id": "tesseract-ocr", - "version": "1.1.0", + "version": "2.0.0", "meta": { "label": "OCR", "description": "Extract text from images using OCR engines", From b2df11390f99e03c46c62945075a28caca42d070 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Wed, 21 Jun 2023 16:07:51 +0200 Subject: [PATCH 12/30] wording and explicit language constants --- .../ocr-text-extraction-dataset/recipe.json | 2 +- custom-recipes/ocr-text-extraction-dataset/recipe.py | 12 ++++++------ python-lib/ocr_constants.py | 3 ++- python-lib/ocr_utils.py | 4 ++-- 4 files changed, 11 insertions(+), 10 deletions(-) diff --git a/custom-recipes/ocr-text-extraction-dataset/recipe.json b/custom-recipes/ocr-text-extraction-dataset/recipe.json index 8081a5d..f29b822 100644 --- a/custom-recipes/ocr-text-extraction-dataset/recipe.json +++ b/custom-recipes/ocr-text-extraction-dataset/recipe.json @@ -38,7 +38,7 @@ "name": "recombine_pdf", "label" : "Recombine multiple-page PDF together", "type": "BOOLEAN", - "description": "Multiple-page PDFs and images with name pattern $FILENAME_pdf_page_XXXXX.jpg are extracted into a single text." + "description": "Multiple-page PDFs and images with name pattern $FILENAME_pdf_page_XXXXX.jpg are extracted into a single row." }, { "name": "ocr_engine", diff --git a/custom-recipes/ocr-text-extraction-dataset/recipe.py b/custom-recipes/ocr-text-extraction-dataset/recipe.py index 4e620b2..7b52c32 100644 --- a/custom-recipes/ocr-text-extraction-dataset/recipe.py +++ b/custom-recipes/ocr-text-extraction-dataset/recipe.py @@ -1,15 +1,15 @@ import logging import pandas as pd from pdf2image import convert_from_bytes -from time import perf_counter import re +from time import perf_counter -from ocr_constants import Constants from dataiku.customrecipe import get_recipe_config -from tesseractocr.extract_text import text_extraction +from ocr_constants import Constants from ocr_utils import convert_image_to_greyscale_bytes from ocr_utils import get_input_output from ocr_utils import text_extraction_parameters +from tesseractocr.extract_text import text_extraction logger = logging.getLogger(__name__) @@ -19,7 +19,7 @@ params = text_extraction_parameters(get_recipe_config()) input_filenames = input_folder.list_paths_in_partition() -total_images = len(input_filenames) +total_files = len(input_filenames) rows = [] @@ -29,7 +29,7 @@ if suffix not in Constants.TYPES: logger.info("OCR - Rejecting {} because it is not a {} file.".format(sample_file, '/'.join(Constants.TYPES))) - logger.info("OCR - Rejected {}/{} images".format(i+1, total_images)) + logger.info("OCR - Rejected {}/{} files".format(i+1, total_files)) continue with input_folder.get_download_stream(sample_file) as stream: @@ -49,7 +49,7 @@ img_text = text_extraction(img_bytes, params) rows.append({'file': prefix, 'text': img_text}) - logger.info("OCR - Extracted text from {}/{} files (in {:.2f} seconds)".format(i+1, total_images, perf_counter() - start)) + logger.info("OCR - Extracted text from {}/{} files (in {:.2f} seconds)".format(i+1, total_files, perf_counter() - start)) df = pd.DataFrame(rows) diff --git a/python-lib/ocr_constants.py b/python-lib/ocr_constants.py index 2aa8d48..a4cb3f3 100644 --- a/python-lib/ocr_constants.py +++ b/python-lib/ocr_constants.py @@ -6,7 +6,8 @@ class Constants: DPI = "dpi" QUALITY = "quality" RECOMBINE_PDF = "recombine_pdf" - LANGUAGE = "language" + LANGUAGE_TESSERACT = "language" + LANGUAGE_EASYOCR = "language_easyocr" OCR_ENGINE = "ocr_engine" DEFAULT_ENGINE = "default" TESSERACT = "tesseract" diff --git a/python-lib/ocr_utils.py b/python-lib/ocr_utils.py index 7fd3033..4fb78dc 100644 --- a/python-lib/ocr_utils.py +++ b/python-lib/ocr_utils.py @@ -60,10 +60,10 @@ def text_extraction_parameters(recipe_config): params[Constants.OCR_ENGINE] = _get_ocr_engine(recipe_config) advanced = recipe_config.get('advanced_parameters', False) if params[Constants.OCR_ENGINE] == Constants.TESSERACT: - params[Constants.LANGUAGE] = recipe_config.get(Constants.LANGUAGE, "eng") if advanced else "eng" + params[Constants.LANGUAGE_TESSERACT] = recipe_config.get(Constants.LANGUAGE_TESSERACT, "eng") if advanced else "eng" elif params[Constants.OCR_ENGINE] == Constants.EASYOCR: import easyocr - language = recipe_config.get(Constants.LANGUAGE, "en") if advanced else "en" + language = recipe_config.get(Constants.LANGUAGE_EASYOCR, "en") if advanced else "en" params[Constants.EASYOCR_READER] = easyocr.Reader([language]) return params From 10523f45e02022238728220bf5b92237d7e1df90 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Wed, 21 Jun 2023 17:15:11 +0200 Subject: [PATCH 13/30] add comment --- python-lib/ocr_utils.py | 1 + 1 file changed, 1 insertion(+) diff --git a/python-lib/ocr_utils.py b/python-lib/ocr_utils.py index 4fb78dc..f0062d9 100644 --- a/python-lib/ocr_utils.py +++ b/python-lib/ocr_utils.py @@ -64,6 +64,7 @@ def text_extraction_parameters(recipe_config): elif params[Constants.OCR_ENGINE] == Constants.EASYOCR: import easyocr language = recipe_config.get(Constants.LANGUAGE_EASYOCR, "en") if advanced else "en" + # instantiate the easyocr.Reader only once here because it takes some time params[Constants.EASYOCR_READER] = easyocr.Reader([language]) return params From 42533f42de9df7b2558899884649958f55860cbd Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Wed, 21 Jun 2023 18:08:09 +0200 Subject: [PATCH 14/30] downgrade pytesseract --- code-env/python/spec/requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index f8836c6..49958af 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -1,5 +1,5 @@ pdf2image==1.16.0 -pytesseract==0.3.8 +pytesseract==0.3.7 Pillow==8.2.0 matplotlib==3.3.4; python_version <= '3.9' matplotlib==3.7.1; python_version >= '3.10' From 1afd95e6844696a0246efd22d941bdec4c6b4430 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Fri, 23 Jun 2023 17:02:38 +0200 Subject: [PATCH 15/30] small fix --- python-lib/tesseractocr/extract_text.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python-lib/tesseractocr/extract_text.py b/python-lib/tesseractocr/extract_text.py index b4d961e..3a3e9f0 100644 --- a/python-lib/tesseractocr/extract_text.py +++ b/python-lib/tesseractocr/extract_text.py @@ -22,7 +22,7 @@ def text_extraction(img_bytes, params): if params[Constants.OCR_ENGINE] == Constants.TESSERACT: try: img = np.array(img) - img_text = pytesseract.image_to_string(img, lang=params[Constants.LANGUAGE]) + img_text = pytesseract.image_to_string(img, lang=params[Constants.LANGUAGE_TESSERACT]) except Exception as e: raise Exception("OCR - Error calling pytesseract: {}".format(e)) elif params[Constants.OCR_ENGINE] == Constants.EASYOCR: From 98b59551a09e69d2047c02b26ce482f4a6bb90fd Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Mon, 26 Jun 2023 14:09:13 +0200 Subject: [PATCH 16/30] use same torch version as DSS --- code-env/python/spec/requirements.txt | 2 ++ 1 file changed, 2 insertions(+) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index 49958af..cddbc38 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -6,4 +6,6 @@ matplotlib==3.7.1; python_version >= '3.10' opencv-python==4.5.1.48; python_version <= '3.9' opencv-python==4.7.0.72; python_version >= '3.10' deskew==0.10.33 +torch==1.11.0; python_version >= '3.10' +torch==1.9.1; python_version <= '3.9' easyocr==1.7.0 \ No newline at end of file From 2f2a9dac43212ebbbded3fd80647c9aa46a3bcca Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Mon, 26 Jun 2023 18:36:31 +0200 Subject: [PATCH 17/30] use pypdfium2 for pdf --- code-env/python/spec/requirements.txt | 2 +- custom-recipes/image-conversion/recipe.py | 10 +++++----- custom-recipes/ocr-text-extraction-dataset/recipe.py | 5 ++--- 3 files changed, 8 insertions(+), 9 deletions(-) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index cddbc38..4781905 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -1,4 +1,4 @@ -pdf2image==1.16.0 +pypdfium2==4.16.0 pytesseract==0.3.7 Pillow==8.2.0 matplotlib==3.3.4; python_version <= '3.9' diff --git a/custom-recipes/image-conversion/recipe.py b/custom-recipes/image-conversion/recipe.py index 10448e4..7f47122 100644 --- a/custom-recipes/image-conversion/recipe.py +++ b/custom-recipes/image-conversion/recipe.py @@ -1,9 +1,11 @@ from dataiku.customrecipe import get_recipe_config -from pdf2image import convert_from_bytes from PIL import Image from io import BytesIO import logging -from ocr_utils import get_input_output, convert_image_to_greyscale_bytes, image_conversion_parameters +from ocr_utils import get_input_output +from ocr_utils import convert_image_to_greyscale_bytes +from ocr_utils import image_conversion_parameters +from ocr_utils import pdf_to_pil_images_iterator from ocr_constants import Constants logger = logging.getLogger(__name__) @@ -25,9 +27,7 @@ img_bytes = stream.read() if suffix == "pdf": - pdf_images = convert_from_bytes(img_bytes, fmt='jpg', dpi=params[Constants.DPI]) - - for j, img in enumerate(pdf_images): + for j, img in enumerate(pdf_to_pil_images_iterator(img_bytes)): img_bytes = convert_image_to_greyscale_bytes(img, quality=params[Constants.QUALITY]) output_folder.upload_data("{0}/{0}{1}{2:05d}.jpg".format(prefix, Constants.PDF_MULTI_SUFFIX, j+1), img_bytes) diff --git a/custom-recipes/ocr-text-extraction-dataset/recipe.py b/custom-recipes/ocr-text-extraction-dataset/recipe.py index 7b52c32..fcf7bf7 100644 --- a/custom-recipes/ocr-text-extraction-dataset/recipe.py +++ b/custom-recipes/ocr-text-extraction-dataset/recipe.py @@ -1,6 +1,5 @@ import logging import pandas as pd -from pdf2image import convert_from_bytes import re from time import perf_counter @@ -8,6 +7,7 @@ from ocr_constants import Constants from ocr_utils import convert_image_to_greyscale_bytes from ocr_utils import get_input_output +from ocr_utils import pdf_to_pil_images_iterator from ocr_utils import text_extraction_parameters from tesseractocr.extract_text import text_extraction @@ -38,8 +38,7 @@ start = perf_counter() if suffix == "pdf": - pdf_images = convert_from_bytes(img_bytes, fmt='jpg') - for j, img in enumerate(pdf_images): + for j, img in enumerate(pdf_to_pil_images_iterator(img_bytes)): img_bytes = convert_image_to_greyscale_bytes(img) img_text = text_extraction(img_bytes, params) From 5399012d6f0c12c053edc49b5a542a9d15f88a3c Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Tue, 27 Jun 2023 10:26:32 +0200 Subject: [PATCH 18/30] missing method --- python-lib/ocr_utils.py | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/python-lib/ocr_utils.py b/python-lib/ocr_utils.py index f0062d9..9bb8604 100644 --- a/python-lib/ocr_utils.py +++ b/python-lib/ocr_utils.py @@ -2,6 +2,7 @@ from dataiku.customrecipe import get_input_names_for_role, get_output_names_for_role from io import BytesIO from ocr_constants import Constants +import pypdfium2 as pdfium from shutil import which @@ -23,6 +24,15 @@ def get_input_output(input_type='dataset', output_type='dataset'): return input_obj, output_obj +def pdf_to_pil_images_iterator(pdf_bytes, dpi=None): + """ iterator over the multiple images of pdf bytes """ + pdf_pages = pdfium.PdfDocument("minimal-document.pdf") + # scale is DPI / 72 according to pypdfium2 doc + scale = dpi / 72 if dpi else 2 + for pdf_page in pdf_pages: + yield pdf_page.render(scale=scale).to_pil() + + def convert_image_to_greyscale_bytes(img, quality=75): """ convert a PIL image to greyscale with a specified dpi and output image as bytes """ img = img.convert('L') From feab0017606fdd9081e3b00c34cd4d9be856c416 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Tue, 27 Jun 2023 10:27:57 +0200 Subject: [PATCH 19/30] typo --- python-lib/ocr_utils.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python-lib/ocr_utils.py b/python-lib/ocr_utils.py index 9bb8604..7243aad 100644 --- a/python-lib/ocr_utils.py +++ b/python-lib/ocr_utils.py @@ -26,7 +26,7 @@ def get_input_output(input_type='dataset', output_type='dataset'): def pdf_to_pil_images_iterator(pdf_bytes, dpi=None): """ iterator over the multiple images of pdf bytes """ - pdf_pages = pdfium.PdfDocument("minimal-document.pdf") + pdf_pages = pdfium.PdfDocument(pdf_bytes) # scale is DPI / 72 according to pypdfium2 doc scale = dpi / 72 if dpi else 2 for pdf_page in pdf_pages: From 320296c7be9d688c07b1d8f2edeb22899ab81921 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Tue, 27 Jun 2023 11:23:17 +0200 Subject: [PATCH 20/30] update changelog --- CHANGELOG.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 04c153f..5f79351 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,9 +3,10 @@ ## [Version 2.0.0](https://github.com/dataiku/dss-plugin-tesseract-ocr/releases/tag/v2.0.0) - Feature release - 2023-06 - Add EasyOCR -- Check if tesseract is installed before running the recipe +- Add an option for a default OCR engine that fallbacks to EasyOCR if tesseract is not installed on the system - Support of PDFs in the text extraction recipe - Remove "Tesseract" from the plugin name +- Use pypdfium2 instead of pdf2images to not depend on any system packages ## [Version 1.0.3](https://github.com/dataiku/dss-plugin-tesseract-ocr/releases/tag/v1.0.3) - Update release - 2023-04 From 3cdea8ee3a3582009c6f82ba4216d2ee44188bed Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Tue, 27 Jun 2023 11:32:51 +0200 Subject: [PATCH 21/30] use up-to-date pypdfium2 version --- code-env/python/spec/requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index 4781905..65a7a49 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -1,4 +1,4 @@ -pypdfium2==4.16.0 +pypdfium2==4.17.0 pytesseract==0.3.7 Pillow==8.2.0 matplotlib==3.3.4; python_version <= '3.9' From 9f248ec4e82d36980526be8aed2bbd50e548f5b6 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Tue, 27 Jun 2023 12:37:24 +0200 Subject: [PATCH 22/30] no advanced param for default engine --- .../ocr-text-extraction-dataset/recipe.json | 3 ++- python-lib/ocr_utils.py | 17 ++++++++--------- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/custom-recipes/ocr-text-extraction-dataset/recipe.json b/custom-recipes/ocr-text-extraction-dataset/recipe.json index f29b822..f5c280a 100644 --- a/custom-recipes/ocr-text-extraction-dataset/recipe.json +++ b/custom-recipes/ocr-text-extraction-dataset/recipe.json @@ -52,7 +52,8 @@ { "name": "advanced_parameters", "label" : "Advanced preprocessing parameters", - "type": "BOOLEAN" + "type": "BOOLEAN", + "visibilityCondition" : "model.ocr_engine != 'default'" }, { "name": "language", diff --git a/python-lib/ocr_utils.py b/python-lib/ocr_utils.py index 7243aad..c12d1fe 100644 --- a/python-lib/ocr_utils.py +++ b/python-lib/ocr_utils.py @@ -67,8 +67,15 @@ def text_extraction_parameters(recipe_config): """ retrieve text extraction recipe parameters """ params = {} params[Constants.RECOMBINE_PDF] = recipe_config.get(Constants.RECOMBINE_PDF, False) - params[Constants.OCR_ENGINE] = _get_ocr_engine(recipe_config) + selected_ocr_engine = recipe_config.get(Constants.OCR_ENGINE, Constants.DEFAULT_ENGINE) advanced = recipe_config.get('advanced_parameters', False) + + if selected_ocr_engine == Constants.DEFAULT_ENGINE: + advanced = False + selected_ocr_engine = get_default_ocr_engine() + + params[Constants.OCR_ENGINE] = selected_ocr_engine + if params[Constants.OCR_ENGINE] == Constants.TESSERACT: params[Constants.LANGUAGE_TESSERACT] = recipe_config.get(Constants.LANGUAGE_TESSERACT, "eng") if advanced else "eng" elif params[Constants.OCR_ENGINE] == Constants.EASYOCR: @@ -80,14 +87,6 @@ def text_extraction_parameters(recipe_config): return params -def _get_ocr_engine(recipe_config): - selected_ocr_engine = recipe_config.get(Constants.OCR_ENGINE, Constants.DEFAULT_ENGINE) - if selected_ocr_engine == Constants.DEFAULT_ENGINE: - return get_default_ocr_engine() - else: - return selected_ocr_engine - - def get_default_ocr_engine(): if which("tesseract") is not None: # check if tesseract is in the path return Constants.TESSERACT From c5b06913841c9af1f69de4776869327fb3fa9a76 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Tue, 27 Jun 2023 13:04:20 +0200 Subject: [PATCH 23/30] remove python 3.11 --- code-env/python/desc.json | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/code-env/python/desc.json b/code-env/python/desc.json index 33fb722..100f24e 100644 --- a/code-env/python/desc.json +++ b/code-env/python/desc.json @@ -4,8 +4,7 @@ "PYTHON37", "PYTHON38", "PYTHON39", - "PYTHON310", - "PYTHON311" + "PYTHON310" ], "corePackagesSet": "AUTO", "forceConda": false, From b9c3744eef8928ecf2b5a4db5ab0311f00b8ab48 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Tue, 27 Jun 2023 13:51:52 +0200 Subject: [PATCH 24/30] use custom model path for UIF --- python-lib/ocr_utils.py | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/python-lib/ocr_utils.py b/python-lib/ocr_utils.py index c12d1fe..91bb424 100644 --- a/python-lib/ocr_utils.py +++ b/python-lib/ocr_utils.py @@ -2,6 +2,7 @@ from dataiku.customrecipe import get_input_names_for_role, get_output_names_for_role from io import BytesIO from ocr_constants import Constants +import os import pypdfium2 as pdfium from shutil import which @@ -82,7 +83,15 @@ def text_extraction_parameters(recipe_config): import easyocr language = recipe_config.get(Constants.LANGUAGE_EASYOCR, "en") if advanced else "en" # instantiate the easyocr.Reader only once here because it takes some time - params[Constants.EASYOCR_READER] = easyocr.Reader([language]) + # use tmp folders inside the job temporary folder to store the model and the custom network model (note that this one isn't used) + model_storage_directory = os.path.join(os.getcwd(), "easyocr_model_tmp") + user_network_directory = os.path.join(os.getcwd(), "easyocr_user_network_tmp") + params[Constants.EASYOCR_READER] = easyocr.Reader( + lang_list=[language], gpu=False, + model_storage_directory=model_storage_directory, + user_network_directory=user_network_directory, + verbose=False + ) return params From ab877fc86a85af5a1e2b86590c033bba4c393941 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Wed, 28 Jun 2023 11:13:52 +0200 Subject: [PATCH 25/30] add integration tests --- tests/python/integration/pytest.ini | 2 ++ tests/python/integration/requirements.txt | 4 ++++ tests/python/integration/test_scenario.py | 13 +++++++++++++ 3 files changed, 19 insertions(+) create mode 100644 tests/python/integration/pytest.ini create mode 100644 tests/python/integration/requirements.txt create mode 100644 tests/python/integration/test_scenario.py diff --git a/tests/python/integration/pytest.ini b/tests/python/integration/pytest.ini new file mode 100644 index 0000000..f45b532 --- /dev/null +++ b/tests/python/integration/pytest.ini @@ -0,0 +1,2 @@ +[pytest] +usefixtures = plugin dss_target diff --git a/tests/python/integration/requirements.txt b/tests/python/integration/requirements.txt new file mode 100644 index 0000000..9c9c9f7 --- /dev/null +++ b/tests/python/integration/requirements.txt @@ -0,0 +1,4 @@ +pandas>=1.0,<1.1 +pytest==6.2.1 +dataiku-api-client +git+git://github.com/dataiku/dataiku-plugin-tests-utils.git@master#egg=dataiku-plugin-tests-utils \ No newline at end of file diff --git a/tests/python/integration/test_scenario.py b/tests/python/integration/test_scenario.py new file mode 100644 index 0000000..8fdd276 --- /dev/null +++ b/tests/python/integration/test_scenario.py @@ -0,0 +1,13 @@ +# -*- coding: utf-8 -*- +from dku_plugin_test_utils import dss_scenario + + +TEST_PROJECT_KEY = "TESTOCRPLUGIN" + + +def test_run_image_processing(user_dss_clients): + dss_scenario.run(user_dss_clients, project_key=TEST_PROJECT_KEY, scenario_id="IMAGE_PROCESSING") + + +def test_run_text_extraction(user_dss_clients): + dss_scenario.run(user_dss_clients, project_key=TEST_PROJECT_KEY, scenario_id="TEXT_EXTRACTION") From 0b22d84918c46a2e18b6248f504fcb25e718259d Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Wed, 28 Jun 2023 11:27:40 +0200 Subject: [PATCH 26/30] add jenkinsfile --- Jenkinsfile | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) create mode 100644 Jenkinsfile diff --git a/Jenkinsfile b/Jenkinsfile new file mode 100644 index 0000000..325695d --- /dev/null +++ b/Jenkinsfile @@ -0,0 +1,54 @@ +pipeline { + options { + disableConcurrentBuilds() + } + agent { label 'dss-plugin-tests'} + environment { + PLUGIN_INTEGRATION_TEST_INSTANCE="$HOME/instance_config.json" + UNIT_TEST_FILES_STATUS_CODE = sh(script: 'ls ./tests/*/unit/test*', returnStatus: true) + INTEGRATION_TEST_FILES_STATUS_CODE = sh(script: 'ls ./tests/*/integration/test*', returnStatus: true) + } + stages { + stage('Run Unit Tests') { + when { environment name: 'UNIT_TEST_FILES_STATUS_CODE', value: "0"} + steps { + sh 'echo "Running unit tests"' + catchError(stageResult: 'FAILURE') { + sh """ + make unit-tests + """ + } + sh 'echo "Done with unit tests"' + } + } + stage('Run Integration Tests') { + when { environment name: 'INTEGRATION_TEST_FILES_STATUS_CODE', value: "0"} + steps { + sh 'echo "Running integration tests"' + catchError(stageResult: 'FAILURE') { + sh """ + make integration-tests + """ + } + sh 'echo "Done with integration tests"' + } + } + } + post { + always { + script { + allure([ + includeProperties: false, + jdk: '', + properties: [], + reportBuildPolicy: 'ALWAYS', + results: [[path: 'tests/allure_report']] + ]) + + def status = currentBuild.currentResult + sh "file_name=\$(echo ${env.JOB_NAME} | tr '/' '-').status; touch \$file_name; echo \"${env.BUILD_URL};${env.CHANGE_TITLE};${env.CHANGE_AUTHOR};${env.CHANGE_URL};${env.BRANCH_NAME};${status};\" >> $HOME/daily-statuses/\$file_name" + cleanWs() + } + } + } +} \ No newline at end of file From 737c2ffaec856fb8f58442c9e5cdef46dbbffc47 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Wed, 28 Jun 2023 11:47:18 +0200 Subject: [PATCH 27/30] make integration tests --- Makefile | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/Makefile b/Makefile index fed3978..1cc23c0 100644 --- a/Makefile +++ b/Makefile @@ -18,13 +18,18 @@ plugin: @echo "[SUCCESS] Archiving plugin to dist/ folder: Done!" unit-tests: - @echo "[START] Running unit tests..." - @echo "[SUCCESS] Running unit tests: Done!" + @echo "No unit tests" integration-tests: - @echo "[START] Running integration tests..." - # TODO add integration tests - @echo "[SUCCESS] Running integration tests: Done!" + @echo "Running integration tests..." + @( \ + rm -rf ./env/; \ + python3 -m venv env/; \ + source env/bin/activate; \ + pip3 install --upgrade pip;\ + pip install --no-cache-dir -r tests/python/integration/requirements.txt; \ + pytest tests/python/integration --alluredir=tests/allure_report || ret=$$?; exit $$ret \ + ) tests: unit-tests integration-tests From 6bd2646f11f98fb829fd0352450b254c56963218 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Wed, 28 Jun 2023 12:13:14 +0200 Subject: [PATCH 28/30] code refacto --- custom-recipes/image-conversion/recipe.py | 2 +- .../image-processing-custom/recipe.py | 3 ++- .../ocr-text-extraction-dataset/recipe.py | 7 ++++--- python-lib/ocr_recipes_io_utils.py | 21 +++++++++++++++++++ python-lib/ocr_utils.py | 20 ------------------ 5 files changed, 28 insertions(+), 25 deletions(-) create mode 100644 python-lib/ocr_recipes_io_utils.py diff --git a/custom-recipes/image-conversion/recipe.py b/custom-recipes/image-conversion/recipe.py index 7f47122..35a96ad 100644 --- a/custom-recipes/image-conversion/recipe.py +++ b/custom-recipes/image-conversion/recipe.py @@ -2,7 +2,7 @@ from PIL import Image from io import BytesIO import logging -from ocr_utils import get_input_output +from ocr_recipes_io_utils import get_input_output from ocr_utils import convert_image_to_greyscale_bytes from ocr_utils import image_conversion_parameters from ocr_utils import pdf_to_pil_images_iterator diff --git a/custom-recipes/image-processing-custom/recipe.py b/custom-recipes/image-processing-custom/recipe.py index 347a0ac..df02be2 100644 --- a/custom-recipes/image-processing-custom/recipe.py +++ b/custom-recipes/image-processing-custom/recipe.py @@ -3,7 +3,8 @@ from io import BytesIO import numpy as np import logging -from ocr_utils import get_input_output, image_processing_parameters +from ocr_recipes_io_utils import get_input_output +from ocr_utils import image_processing_parameters from ocr_constants import Constants logger = logging.getLogger(__name__) diff --git a/custom-recipes/ocr-text-extraction-dataset/recipe.py b/custom-recipes/ocr-text-extraction-dataset/recipe.py index fcf7bf7..629b870 100644 --- a/custom-recipes/ocr-text-extraction-dataset/recipe.py +++ b/custom-recipes/ocr-text-extraction-dataset/recipe.py @@ -1,12 +1,13 @@ import logging +import os import pandas as pd import re from time import perf_counter from dataiku.customrecipe import get_recipe_config from ocr_constants import Constants +from ocr_recipes_io_utils import get_input_output from ocr_utils import convert_image_to_greyscale_bytes -from ocr_utils import get_input_output from ocr_utils import pdf_to_pil_images_iterator from ocr_utils import text_extraction_parameters from tesseractocr.extract_text import text_extraction @@ -24,8 +25,8 @@ rows = [] for i, sample_file in enumerate(input_filenames): - prefix = sample_file.split('.')[0] - suffix = sample_file.split('.')[-1] + prefix, suffix = os.path.splitext(sample_file) + suffix = suffix[1:] # removing the dot from the extension if suffix not in Constants.TYPES: logger.info("OCR - Rejecting {} because it is not a {} file.".format(sample_file, '/'.join(Constants.TYPES))) diff --git a/python-lib/ocr_recipes_io_utils.py b/python-lib/ocr_recipes_io_utils.py new file mode 100644 index 0000000..0a43ec7 --- /dev/null +++ b/python-lib/ocr_recipes_io_utils.py @@ -0,0 +1,21 @@ +import dataiku +from dataiku.customrecipe import get_input_names_for_role +from dataiku.customrecipe import get_output_names_for_role + + +def get_input_output(input_type='dataset', output_type='dataset'): + if input_type == 'folder': + input_names = get_input_names_for_role('input_folder')[0] + input_obj = dataiku.Folder(input_names) + else: + input_names = get_input_names_for_role('input_dataset')[0] + input_obj = dataiku.Dataset(input_names) + + if output_type == 'folder': + output_names = get_output_names_for_role('output_folder')[0] + output_obj = dataiku.Folder(output_names) + else: + output_names = get_output_names_for_role('output_dataset')[0] + output_obj = dataiku.Dataset(output_names) + + return input_obj, output_obj diff --git a/python-lib/ocr_utils.py b/python-lib/ocr_utils.py index 91bb424..33ea0c1 100644 --- a/python-lib/ocr_utils.py +++ b/python-lib/ocr_utils.py @@ -1,5 +1,3 @@ -import dataiku -from dataiku.customrecipe import get_input_names_for_role, get_output_names_for_role from io import BytesIO from ocr_constants import Constants import os @@ -7,24 +5,6 @@ from shutil import which -def get_input_output(input_type='dataset', output_type='dataset'): - if input_type == 'folder': - input_names = get_input_names_for_role('input_folder')[0] - input_obj = dataiku.Folder(input_names) - else: - input_names = get_input_names_for_role('input_dataset')[0] - input_obj = dataiku.Dataset(input_names) - - if output_type == 'folder': - output_names = get_output_names_for_role('output_folder')[0] - output_obj = dataiku.Folder(output_names) - else: - output_names = get_output_names_for_role('output_dataset')[0] - output_obj = dataiku.Dataset(output_names) - - return input_obj, output_obj - - def pdf_to_pil_images_iterator(pdf_bytes, dpi=None): """ iterator over the multiple images of pdf bytes """ pdf_pages = pdfium.PdfDocument(pdf_bytes) From 30620cdd5194d1646bdf942000a8da6302acad66 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Wed, 28 Jun 2023 16:40:59 +0200 Subject: [PATCH 29/30] add required package for easyocr in py36 --- code-env/python/spec/requirements.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/code-env/python/spec/requirements.txt b/code-env/python/spec/requirements.txt index 65a7a49..80bca16 100644 --- a/code-env/python/spec/requirements.txt +++ b/code-env/python/spec/requirements.txt @@ -8,4 +8,5 @@ opencv-python==4.7.0.72; python_version >= '3.10' deskew==0.10.33 torch==1.11.0; python_version >= '3.10' torch==1.9.1; python_version <= '3.9' -easyocr==1.7.0 \ No newline at end of file +easyocr==1.7.0 +packaging==21.3 \ No newline at end of file From ee2fd15e1c98077e81d83cfb64155ce12cd95238 Mon Sep 17 00:00:00 2001 From: StanislasGuinel Date: Thu, 29 Jun 2023 14:04:54 +0200 Subject: [PATCH 30/30] remove pdf2images doc from readme --- README.md | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/README.md b/README.md index d9d9da5..c8c142f 100644 --- a/README.md +++ b/README.md @@ -47,19 +47,6 @@ Using macports: `sudo port install tesseract` For more informations, go to: . -### pdf2image - -To be able to use the python package pdf2image: - -#### Linux -Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils - -For more informations, go to: . - -#### Mac -For macOS using brew: `brew install poppler`. -Mac users will have to install poppler for Mac (). - ### Specific languages If you want to specify languages in tesseract, you must install them on the machine with your DSS instance, you can find instructions on how to install them and the code for each language here .