📚 MarkItDown Converter App

🔄 A Streamlit-powered application that converts various file formats to Markdown and CSV, with integrated LLM capabilities for Q&A pair generation - perfect for LLM fine-tuning datasets.

🌟 Features

📄 Multi-format file conversion support:
- PDF documents
- PowerPoint presentations
- Word documents
- Excel spreadsheets
- Images (EXIF + OCR)
- Audio (EXIF + transcription)
- HTML files
- Text-based formats (CSV, JSON, XML)
- ZIP archives
🤖 LLM integration for enhanced conversion
❓ Automatic Q&A pair generation
🔤 Embedding generation support
📊 CSV export with structured data
🎯 Perfect for creating LLM fine-tuning datasets

🛠️ Prerequisites

Python 3.10+
Ollama (for LLM and embedding features)
Virtual environment capability

🚀 Installation

Clone the repository:

git clone https://github.com/MoAshour93/Convert_PDF_Office_Files_to_MarkDown_CSV.git
cd markitdown-converter-app

Create and activate virtual environment:

python -m venv markitdown_env
# On Windows:
markitdown_env\Scripts\activate
# On Unix or MacOS:
source markitdown_env/bin/activate

Install required packages:

pip install -r requirements.txt

Install Ollama (required for LLM features):

Visit Ollama.ai
Follow installation instructions for your OS
Install embedding models (encoder only models) using

ollama pull {model_name} #-->(e.g. all-minilm, nomic-embed-text ...etc)

Install other large language models (encoder-decoder models) using

ollama pull {model_name} #-->(e.g. llama3.3,qwq ...etc)

💫 Usage

Start the application:

streamlit run MarkItDown_Conversion_App_v0.4.py

Access the web interface:

Open your browser
Navigate to http://localhost:8501

Choose your conversion path:

File to Markdown: Direct file conversion with optional LLM enhancement
Markdown to CSV: Extract or generate Q&A pairs from markdown
Direct File to Q&A: Convert files directly to Q&A pairs with embeddings

🎓 Core Package

This application is built on top of Microsoft's MarkItDown utility. Basic Python usage:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)

With LLM integration:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4")
result = md.convert("example.jpg")
print(result.text_content)

📜 License

This project is licensed under CC BY-NC-SA 4.0:

For research and non-commercial use only
Attribution required
Modifications allowed with creator notification
Commercial use requires explicit permission

👤 Author

Mohamed Ashour

Connect with me:

📧 Email: mo_ashour1@outlook.com
💼 LinkedIn: Mohamed Ashour
🌐 Website: APC Mastery Path
📽️Youtube:APC Mastery Path

🤝 Contributing

Feel free to:

Open issues
Submit Pull Requests
Share improvements
Report bugs

For major changes, please open an issue first to discuss what you would like to change.

🙏 Acknowledgments

Microsoft's MarkItDown team for the core conversion utility
Streamlit for the web framework
Ollama for LLM integration capabilities

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
MarkItDown_Conversion_App_v0.4.py		MarkItDown_Conversion_App_v0.4.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 MarkItDown Converter App

🌟 Features

🛠️ Prerequisites

🚀 Installation

💫 Usage

🎓 Core Package

📜 License

👤 Author

🤝 Contributing

🙏 Acknowledgments

About

Releases

Packages

Languages

License

MoAshour93/Convert_PDF_Office_Files_to_MarkDown_CSV

Folders and files

Latest commit

History

Repository files navigation

📚 MarkItDown Converter App

🌟 Features

🛠️ Prerequisites

🚀 Installation

💫 Usage

🎓 Core Package

📜 License

👤 Author

🤝 Contributing

🙏 Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages