pdf processing with OCR is stuck #1148

pradyrk · 2025-03-12T16:56:27Z

Question

I am processing single pdf of 4 MB with the below OCR options ,

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options

Process is running indefinitely for hours , I am not sure whether its something to do with files , its pptx file converted to pdf , use-case is to extract the complex tables

I tried using accelarator options, tried both CPU and GPU , it doesnt help too.

accelerator_options = AcceleratorOptions(
num_threads=8, device=AcceleratorDevice.CPU
)

How do I address this issue ? Is there a way to process pages within PDF in-parallel ?
...

pradyrk added the question Further information is requested label Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf processing with OCR is stuck #1148

pdf processing with OCR is stuck #1148

pradyrk commented Mar 12, 2025 •

edited

Loading

pdf processing with OCR is stuck #1148

pdf processing with OCR is stuck #1148

Comments

pradyrk commented Mar 12, 2025 • edited Loading

Question

pradyrk commented Mar 12, 2025 •

edited

Loading