Skip to content

πŸ”„ Streamlit-powered file conversion utility that transforms various document formats into Markdown and structured CSV data, with integrated LLM capabilities for automated Q&A generation - perfect for creating LLM fine-tuning datasets.

License

Notifications You must be signed in to change notification settings

MoAshour93/Convert_PDF_Office_Files_to_MarkDown_CSV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“š MarkItDown Converter App

πŸ”„ A Streamlit-powered application that converts various file formats to Markdown and CSV, with integrated LLM capabilities for Q&A pair generation - perfect for LLM fine-tuning datasets.

License: CC BY-NC-SA 4.0

🌟 Features

  • πŸ“„ Multi-format file conversion support:
    • PDF documents
    • PowerPoint presentations
    • Word documents
    • Excel spreadsheets
    • Images (EXIF + OCR)
    • Audio (EXIF + transcription)
    • HTML files
    • Text-based formats (CSV, JSON, XML)
    • ZIP archives
  • πŸ€– LLM integration for enhanced conversion
  • ❓ Automatic Q&A pair generation
  • πŸ”€ Embedding generation support
  • πŸ“Š CSV export with structured data
  • 🎯 Perfect for creating LLM fine-tuning datasets

πŸ› οΈ Prerequisites

  • Python 3.10+
  • Ollama (for LLM and embedding features)
  • Virtual environment capability

πŸš€ Installation

  1. Clone the repository:
git clone https://github.com/MoAshour93/Convert_PDF_Office_Files_to_MarkDown_CSV.git
cd markitdown-converter-app
  1. Create and activate virtual environment:
python -m venv markitdown_env
# On Windows:
markitdown_env\Scripts\activate
# On Unix or MacOS:
source markitdown_env/bin/activate
  1. Install required packages:
pip install -r requirements.txt
  1. Install Ollama (required for LLM features):
  • Visit Ollama.ai
  • Follow installation instructions for your OS
  • Install embedding models (encoder only models) using
ollama pull {model_name} #-->(e.g. all-minilm, nomic-embed-text ...etc)
  • Install other large language models (encoder-decoder models) using
ollama pull {model_name} #-->(e.g. llama3.3,qwq ...etc)

πŸ’« Usage

  1. Start the application:
streamlit run MarkItDown_Conversion_App_v0.4.py
  1. Access the web interface:
  • Open your browser
  • Navigate to http://localhost:8501
  1. Choose your conversion path:
  • File to Markdown: Direct file conversion with optional LLM enhancement
  • Markdown to CSV: Extract or generate Q&A pairs from markdown
  • Direct File to Q&A: Convert files directly to Q&A pairs with embeddings

πŸŽ“ Core Package

This application is built on top of Microsoft's MarkItDown utility. Basic Python usage:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)

With LLM integration:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4")
result = md.convert("example.jpg")
print(result.text_content)

πŸ“œ License

This project is licensed under CC BY-NC-SA 4.0:

  • For research and non-commercial use only
  • Attribution required
  • Modifications allowed with creator notification
  • Commercial use requires explicit permission

πŸ‘€ Author

Mohamed Ashour

Connect with me:

🀝 Contributing

Feel free to:

  • Open issues
  • Submit Pull Requests
  • Share improvements
  • Report bugs

For major changes, please open an issue first to discuss what you would like to change.

πŸ™ Acknowledgments

  • Microsoft's MarkItDown team for the core conversion utility
  • Streamlit for the web framework
  • Ollama for LLM integration capabilities

About

πŸ”„ Streamlit-powered file conversion utility that transforms various document formats into Markdown and structured CSV data, with integrated LLM capabilities for automated Q&A generation - perfect for creating LLM fine-tuning datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages