Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add timeout limit to document parsing job. #270 #320

Closed
wants to merge 0 commits into from

Conversation

ab-shrek
Copy link
Contributor

Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 87584.07it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 24.12 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 24.13 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=5 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 29037.49it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf WARNING:docling.pipeline.base_pipeline:Document processing time (6 s) exceeded the specified timeout of 5 s INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 10.82 sec. WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpzedg349h/2206.01062v1.pdf failed to convert. INFO:docling.cli.main:Processed 1 docs, of which 1 failed INFO:docling.cli.main:All documents were converted in 10.82 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 88197.98it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 22.59 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 22.60 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

Usage: docling [OPTIONS] source

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from [docx|pptx|html|image|pdf|asciidoc|md] Specify input formats to convert from. Defaults to all formats. [default: None] │
│ --to [md|json|text|doctags] Specify output formats. Defaults to Markdown. [default: None] │
│ --ocr --no-ocr If enabled, the bitmap content will be processed using OCR. [default: ocr] │
│ --force-ocr --no-force-ocr Replace any existing text with OCR generated text over the full content. [default: no-force-ocr] │
│ --ocr-engine [easyocr|tesseract_cli|tesseract] The OCR engine to use. [default: easyocr] │
│ --pdf-backend [pypdfium2|dlparse_v1|dlparse_v2] The PDF backend to use. [default: dlparse_v1] │
│ --table-mode [fast|accurate] The mode to use in the table structure model. [default: fast] │
│ --artifacts-path PATH If provided, the location of the model artifacts. [default: None] │
│ --abort-on-error --no-abort-on-error If enabled, the bitmap content will be processed using OCR. [default: no-abort-on-error] │
│ --output PATH Output directory where results are saved. [default: .] │
│ --version Show version information. │
│ --document-timeout INTEGER The timeout for processing each document, in seconds. [default: None] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Resolves #270

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@nikos-livathinos
Copy link
Collaborator

@ab-shrek thank you for your PR. Here are some points:

  1. It would be better to measure the timeout in fractional seconds. Then the type of pipeline_options.document_timeout should be float.
  2. The signature of the _build_document() method must remain as it is. The pipeline_options are already available as a member of the PaginatedPipeline class, no need to pass the timeout as an additional method parameter. You can just use the self._pipeline_options.document_timeout inside the PaginatedPipeline._build_document().
  3. When measuring elapsed time it is better to use time.monotonic() instead of time.time(). It would be nice to improve also the existing code and use the monotonic() for the measurement of the elapsed time of page batch used for the logging.
  4. Another approach is to measure the elapsed time for the page batch only once inside the loop and sum up to have the elapsed time for the document conversion up to the current page batch.

Copy link

mergify bot commented Nov 18, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@ab-shrek ab-shrek force-pushed the main branch 2 times, most recently from 0b82dbf to ecdc948 Compare November 23, 2024 06:20
@nikos-livathinos
Copy link
Collaborator

@ab-shrek thank you for taking the time to implement my comments. It would be nice to follow the next steps:

  1. Update your forked repo with the upstream Docling main.
  2. Install and run the pre-commit (https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md#coding-style-guidelines)

MyPy complains about 2 issues:

  1. The total_elapsed_time should be float. @ab-shrek please initialize it as total_elapsed_time = 0.0.
  2. The document_timeout should be defined in the mother class PipelineOptions and not in PdfPipelineOptions. I propose to actually move the document_timeout in the base class PipelineOptions even though only the PaginatedPipeline will currently provide an implementation. @cau-git please comment on this.

@cau-git cau-git changed the title enhancement: Add timeout limit to document parsing job. #270 feat: Add timeout limit to document parsing job. #270 Nov 27, 2024
@nikos-livathinos
Copy link
Collaborator

@ab-shrek as we want to merge this PR soon, please feel free to implement the steps from my previous comment.

@ab-shrek
Copy link
Contributor Author

ab-shrek commented Dec 2, 2024

@nikos-livathinos let me get on this asap. Thanks

@nikos-livathinos
Copy link
Collaborator

@ab-shrek very nice, now all tests pass!
You only need to update again your forked repo with the upstream Docling as there are some conflicts in cli/main.py.
I guess after this we are ready to merge this PR.

pipeline_options = PdfPipelineOptions(
do_ocr=ocr,
ocr_options=ocr_options,
do_table_structure=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ab-shrek please pass the document_timeout CLI parameter in the initialization of pipeline_options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add timeout limit to document parsing job.
2 participants