Skip to content

Feature Request: Alternative PDF parsing options for faster processing #29

@liaosvcaf

Description

@liaosvcaf

Feature Request

Add alternative PDF parsing options to avoid MinerU OCR hanging issues.

Problem

  • MinerU OCR stage hangs on large PDFs (>10 pages)
  • Users stuck waiting >20 minutes with no progress
  • No fallback option when OCR fails

Proposed Solutions

Option 1: Add alternative parsers

Support multiple PDF parsing backends:

  • pdfplumber (Python-based, no OCR)
  • PyMuPDF (fast, lightweight)
  • pdf2image + OCR (alternative OCR)

Option 2: Add timeout + retry

  • Add configurable timeout for parsing stage
  • Auto-retry with alternative method on failure

Option 3: Skip parsing option

  • Add --skip-parsing flag
  • Allow users to provide pre-parsed markdown content
  • Useful for users who just want slide generation

Use Case

Converting arXiv papers (10-50 pages) to slides should complete in <5 minutes, not >20 minutes.

Environment

  • Titan (MacBook Pro M2), macOS
  • Tested with arXiv 2602.11865 (42 pages)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions