π A Streamlit-powered application that converts various file formats to Markdown and CSV, with integrated LLM capabilities for Q&A pair generation - perfect for LLM fine-tuning datasets.
- π Multi-format file conversion support:
- PDF documents
- PowerPoint presentations
- Word documents
- Excel spreadsheets
- Images (EXIF + OCR)
- Audio (EXIF + transcription)
- HTML files
- Text-based formats (CSV, JSON, XML)
- ZIP archives
- π€ LLM integration for enhanced conversion
- β Automatic Q&A pair generation
- π€ Embedding generation support
- π CSV export with structured data
- π― Perfect for creating LLM fine-tuning datasets
- Python 3.10+
- Ollama (for LLM and embedding features)
- Virtual environment capability
- Clone the repository:
git clone https://github.com/MoAshour93/Convert_PDF_Office_Files_to_MarkDown_CSV.git
cd markitdown-converter-app
- Create and activate virtual environment:
python -m venv markitdown_env
# On Windows:
markitdown_env\Scripts\activate
# On Unix or MacOS:
source markitdown_env/bin/activate
- Install required packages:
pip install -r requirements.txt
- Install Ollama (required for LLM features):
- Visit Ollama.ai
- Follow installation instructions for your OS
- Install embedding models (encoder only models) using
ollama pull {model_name} #-->(e.g. all-minilm, nomic-embed-text ...etc)
- Install other large language models (encoder-decoder models) using
ollama pull {model_name} #-->(e.g. llama3.3,qwq ...etc)
- Start the application:
streamlit run MarkItDown_Conversion_App_v0.4.py
- Access the web interface:
- Open your browser
- Navigate to
http://localhost:8501
- Choose your conversion path:
- File to Markdown: Direct file conversion with optional LLM enhancement
- Markdown to CSV: Extract or generate Q&A pairs from markdown
- Direct File to Q&A: Convert files directly to Q&A pairs with embeddings
This application is built on top of Microsoft's MarkItDown utility. Basic Python usage:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)
With LLM integration:
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4")
result = md.convert("example.jpg")
print(result.text_content)
This project is licensed under CC BY-NC-SA 4.0:
- For research and non-commercial use only
- Attribution required
- Modifications allowed with creator notification
- Commercial use requires explicit permission
Mohamed Ashour
Connect with me:
- π§ Email: mo_ashour1@outlook.com
- πΌ LinkedIn: Mohamed Ashour
- π Website: APC Mastery Path
- π½οΈYoutube:APC Mastery Path
Feel free to:
- Open issues
- Submit Pull Requests
- Share improvements
- Report bugs
For major changes, please open an issue first to discuss what you would like to change.
- Microsoft's MarkItDown team for the core conversion utility
- Streamlit for the web framework
- Ollama for LLM integration capabilities