feat: Add timeout limit to document parsing job. #270 #320

ab-shrek · 2024-11-12T22:45:05Z

Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 87584.07it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 24.12 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 24.13 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=5 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 29037.49it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf WARNING:docling.pipeline.base_pipeline:Document processing time (6 s) exceeded the specified timeout of 5 s INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 10.82 sec. WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpzedg349h/2206.01062v1.pdf failed to convert. INFO:docling.cli.main:Processed 1 docs, of which 1 failed INFO:docling.cli.main:All documents were converted in 10.82 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 88197.98it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 22.59 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 22.60 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

Usage: docling [OPTIONS] source

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from [docx|pptx|html|image|pdf|asciidoc|md] Specify input formats to convert from. Defaults to all formats. [default: None] │
│ --to [md|json|text|doctags] Specify output formats. Defaults to Markdown. [default: None] │
│ --ocr --no-ocr If enabled, the bitmap content will be processed using OCR. [default: ocr] │
│ --force-ocr --no-force-ocr Replace any existing text with OCR generated text over the full content. [default: no-force-ocr] │
│ --ocr-engine [easyocr|tesseract_cli|tesseract] The OCR engine to use. [default: easyocr] │
│ --pdf-backend [pypdfium2|dlparse_v1|dlparse_v2] The PDF backend to use. [default: dlparse_v1] │
│ --table-mode [fast|accurate] The mode to use in the table structure model. [default: fast] │
│ --artifacts-path PATH If provided, the location of the model artifacts. [default: None] │
│ --abort-on-error --no-abort-on-error If enabled, the bitmap content will be processed using OCR. [default: no-abort-on-error] │
│ --output PATH Output directory where results are saved. [default: .] │
│ --version Show version information. │
│ --document-timeout INTEGER The timeout for processing each document, in seconds. [default: None] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Resolves #270

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the
conventional commits.
Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

nikos-livathinos · 2024-11-17T19:51:28Z

@ab-shrek thank you for your PR. Here are some points:

It would be better to measure the timeout in fractional seconds. Then the type of pipeline_options.document_timeout should be float.
The signature of the _build_document() method must remain as it is. The pipeline_options are already available as a member of the PaginatedPipeline class, no need to pass the timeout as an additional method parameter. You can just use the self._pipeline_options.document_timeout inside the PaginatedPipeline._build_document().
When measuring elapsed time it is better to use time.monotonic() instead of time.time(). It would be nice to improve also the existing code and use the monotonic() for the measurement of the elapsed time of page batch used for the logging.
Another approach is to measure the elapsed time for the page batch only once inside the loop and sum up to have the elapsed time for the document conversion up to the current page batch.

mergify · 2024-11-18T08:31:02Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

nikos-livathinos · 2024-11-24T22:20:06Z

@ab-shrek thank you for taking the time to implement my comments. It would be nice to follow the next steps:

Update your forked repo with the upstream Docling main.
Install and run the pre-commit (https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md#coding-style-guidelines)

MyPy complains about 2 issues:

The total_elapsed_time should be float. @ab-shrek please initialize it as total_elapsed_time = 0.0.
The document_timeout should be defined in the mother class PipelineOptions and not in PdfPipelineOptions. I propose to actually move the document_timeout in the base class PipelineOptions even though only the PaginatedPipeline will currently provide an implementation. @cau-git please comment on this.

nikos-livathinos · 2024-12-02T15:19:10Z

@ab-shrek as we want to merge this PR soon, please feel free to implement the steps from my previous comment.

ab-shrek · 2024-12-02T15:20:39Z

@nikos-livathinos let me get on this asap. Thanks

nikos-livathinos · 2024-12-05T09:46:05Z

@ab-shrek very nice, now all tests pass!
You only need to update again your forked repo with the upstream Docling as there are some conflicts in cli/main.py.
I guess after this we are ready to merge this PR.

nikos-livathinos · 2024-12-06T08:58:24Z

docling/cli/main.py

+        pipeline_options = PdfPipelineOptions(
+            do_ocr=ocr,
+            ocr_options=ocr_options,
+            do_table_structure=True,


@ab-shrek please pass the document_timeout CLI parameter in the initialization of pipeline_options.

PeterStaar-IBM requested review from cau-git, vagenas, PeterStaar-IBM and nikos-livathinos and removed request for cau-git and vagenas November 18, 2024 08:30

PeterStaar-IBM assigned ab-shrek Nov 18, 2024

ab-shrek force-pushed the main branch 2 times, most recently from 0b82dbf to ecdc948 Compare November 23, 2024 06:20

cau-git changed the title ~~enhancement: Add timeout limit to document parsing job. #270~~ feat: Add timeout limit to document parsing job. #270 Nov 27, 2024

ab-shrek closed this Dec 2, 2024

ab-shrek force-pushed the main branch from ecdc948 to 33cff98 Compare December 2, 2024 21:36

ab-shrek reopened this Dec 2, 2024

nikos-livathinos reviewed Dec 6, 2024

View reviewed changes

ab-shrek closed this Dec 8, 2024

ab-shrek force-pushed the main branch from 3d56ac1 to eb7ffcd Compare December 8, 2024 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add timeout limit to document parsing job. #270 #320

feat: Add timeout limit to document parsing job. #270 #320

ab-shrek commented Nov 12, 2024

nikos-livathinos commented Nov 17, 2024

mergify bot commented Nov 18, 2024 •

edited

Loading

nikos-livathinos commented Nov 24, 2024

nikos-livathinos commented Dec 2, 2024

ab-shrek commented Dec 2, 2024

nikos-livathinos commented Dec 5, 2024

nikos-livathinos Dec 6, 2024

feat: Add timeout limit to document parsing job. #270 #320

feat: Add timeout limit to document parsing job. #270 #320

Conversation

ab-shrek commented Nov 12, 2024

nikos-livathinos commented Nov 17, 2024

mergify bot commented Nov 18, 2024 • edited Loading

Merge Protections

🟢 Enforce conventional commit

nikos-livathinos commented Nov 24, 2024

nikos-livathinos commented Dec 2, 2024

ab-shrek commented Dec 2, 2024

nikos-livathinos commented Dec 5, 2024

nikos-livathinos Dec 6, 2024

Choose a reason for hiding this comment

mergify bot commented Nov 18, 2024 •

edited

Loading