Skip to content

Latest commit

 

History

History
159 lines (122 loc) · 3.9 KB

README.md

File metadata and controls

159 lines (122 loc) · 3.9 KB

PDF2EPUB 📚

Convert PDF files to nicely structured Markdown and EPUB format with intelligent layout detection.

✨ Features

  • 📖 Smart layout detection for books and academic papers
  • 🔍 Advanced text extraction and OCR capabilities
  • 📊 Table detection and formatting
  • 🖼️ Image extraction and optimization
  • 📝 Clean markdown output with preserved structure
  • 📱 EPUB generation with customizable styling
  • 🌍 Multi-language support
  • 🚀 GPU acceleration support (NVIDIA & AMD)

🛠️ Dependencies

  • Python 3.9+
  • PyTorch (with CUDA/ROCm support for GPU acceleration)
  • marker-pdf==0.3.10
  • transformers==4.45.2
  • markdown==3.7

💻 Installation

  1. Install Python dependencies:
pip install -r requirements.txt
  1. Install PyTorch:
  • For NVIDIA GPUs, install with CUDA support:
pip install torch torchvision torchaudio
  • For AMD GPUs, install with ROCm support:
pip3 uninstall torch torchvision torchaudio
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
  1. Verify GPU support:
import torch
print(torch.__version__)  # PyTorch version
print(torch.cuda.is_available())  # Should return True for NVIDIA
print(torch.version.hip)  # Should print ROCm version for AMD

🚀 Usage

Basic Usage

Convert a single PDF file:

python main.py input.pdf

Convert all PDFs in a directory:

python main.py input_directory/

Advanced Options

python main.py [input_path] [output_path] [options]

Options:
  --batch-multiplier INT    Batch size multiplier for memory/speed tradeoff (default: 2)
  --max-pages INT          Maximum number of pages to process
  --start-page INT         Page number to start from
  --langs STRING           Comma-separated list of languages in document
  --skip-epub             Skip EPUB generation, only create markdown
  --skip-md               Skip markdown generation, use existing markdown files

Examples

Process a specific range of pages:

python main.py book.pdf --start-page 10 --max-pages 50

Process a multi-language document:

python main.py paper.pdf --langs "English,German"

Convert to markdown only:

python main.py thesis.pdf --skip-epub

Output Structure

output_directory/
├── document_name/
│   ├── document_name.md
│   ├── document_name.epub
│   ├── document_name_metadata.json
│   └── images/
│       ├── image1.png
│       ├── image2.jpg
│       └── ...

🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create a new branch for your feature
  3. Commit your changes
  4. Push to your branch
  5. Create a Pull Request

Please ensure your code follows the existing style and includes appropriate tests.

Development Setup

  1. Clone the repository:
git clone https://github.com/yourusername/pdf2epub.git
cd pdf2epub
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows
  1. Install development dependencies:
pip install -r requirements.txt

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🐛 Known Issues

  • Some image embedding might need manual adjustment
  • Some complex mathematical equations might not be perfectly converted
  • Certain PDF layouts with multiple columns may require manual adjustment
  • Font detection might be imperfect in some cases

🙏 Acknowledgments

This project builds upon several excellent open-source libraries: