Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf processing with OCR is stuck #1148

Open
pradyrk opened this issue Mar 12, 2025 · 0 comments
Open

pdf processing with OCR is stuck #1148

pradyrk opened this issue Mar 12, 2025 · 0 comments
Labels
question Further information is requested

Comments

@pradyrk
Copy link

pradyrk commented Mar 12, 2025

Question

I am processing single pdf of 4 MB with the below OCR options ,

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options

Process is running indefinitely for hours , I am not sure whether its something to do with files , its pptx file converted to pdf , use-case is to extract the complex tables

I tried using accelarator options, tried both CPU and GPU , it doesnt help too.

accelerator_options = AcceleratorOptions(
num_threads=8, device=AcceleratorDevice.CPU
)

How do I address this issue ? Is there a way to process pages within PDF in-parallel ?
...

@pradyrk pradyrk added the question Further information is requested label Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant