-
Notifications
You must be signed in to change notification settings - Fork 417
Open
Description
Feature Request
Add alternative PDF parsing options to avoid MinerU OCR hanging issues.
Problem
- MinerU OCR stage hangs on large PDFs (>10 pages)
- Users stuck waiting >20 minutes with no progress
- No fallback option when OCR fails
Proposed Solutions
Option 1: Add alternative parsers
Support multiple PDF parsing backends:
- pdfplumber (Python-based, no OCR)
- PyMuPDF (fast, lightweight)
- pdf2image + OCR (alternative OCR)
Option 2: Add timeout + retry
- Add configurable timeout for parsing stage
- Auto-retry with alternative method on failure
Option 3: Skip parsing option
- Add --skip-parsing flag
- Allow users to provide pre-parsed markdown content
- Useful for users who just want slide generation
Use Case
Converting arXiv papers (10-50 pages) to slides should complete in <5 minutes, not >20 minutes.
Environment
- Titan (MacBook Pro M2), macOS
- Tested with arXiv 2602.11865 (42 pages)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels