A comprehensive Python application for collecting, processing, and querying academic content from multiple sources using RAG (Retrieval-Augmented Generation) with OpenSearch and Google Gemini.
The Multi-Modal Academic Research System is a sophisticated platform that enables researchers, students, and professionals to:
- Collect academic papers from ArXiv, PubMed Central, and Semantic Scholar
- Process PDFs with text extraction and AI-powered diagram analysis
- Index content using hybrid search (keyword + semantic) with OpenSearch
- Query your knowledge base with natural language using Google Gemini
- Track citations automatically with bibliography export (BibTeX, APA)
- Visualize your collection with interactive dashboards
β Multi-Source Collection: Papers, YouTube lectures, and podcasts β AI-Powered Processing: Gemini Vision for diagram analysis β Hybrid Search: BM25 + semantic vector search β Citation Tracking: Automatic extraction and bibliography export β Interactive UI: Gradio web interface + FastAPI REST API β Data Visualization: Real-time statistics and analytics β SQLite Tracking: Complete metadata and collection history β Free Technologies: Local deployment, no cloud costs
- Python 3.9 or higher
- Docker (for OpenSearch)
- Google Gemini API key (Get free key)
# 1. Clone the repository
git clone https://github.com/yourusername/multi-modal-academic-research-system.git
cd multi-modal-academic-research-system
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set up environment variables
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY
# 5. Start OpenSearch
docker run -p 9200:9200 -e "discovery.type=single-node" opensearchproject/opensearch:latest
# 6. Run the application
python main.pyThe Gradio UI will open at http://localhost:7860
π Detailed Instructions: See Installation Guide and Quick Start Guide
Option 1: Interactive Documentation Site (Recommended)
# Serve documentation with live search and navigation
./serve_docs.sh # Linux/Mac
serve_docs.bat # Windows
# Visit http://127.0.0.1:8000Built with MkDocs Material theme featuring:
- π Full-text search
- π¨ Dark/light mode
- π± Mobile responsive
- π Auto-generated navigation
- π Built-in analytics
Option 2: Static Documentation
Our documentation includes 40+ comprehensive guides totaling 31,000+ lines:
- Installation Guide - Complete setup instructions
- Quick Start - Get running in 5 minutes
- Configuration Guide - Environment and settings
- System Architecture - High-level design
- Data Flow - How data moves through the system
- Technology Stack - Technologies and rationale
- Data Collectors - ArXiv, YouTube, Podcasts
- Data Processors - PDF and video processing
- Indexing System - OpenSearch hybrid search
- Database - SQLite tracking
- API Server - FastAPI REST endpoints
- Orchestration - LangChain + Gemini
- User Interface - Gradio UI
- Collecting Papers - Step-by-step collection
- Custom Searches - Advanced queries
- Export Citations - Bibliography management
- Visualization - Analytics dashboard
- Extending System - Add new features
- Local Deployment - Development setup
- Docker Setup - Containerization
- OpenSearch - Search engine setup
- Production - Scaling and HA
- REST API - Complete API reference
- Database Schema - SQLite structure
- Troubleshooting - Common issues
- FAQ - Frequently asked questions
Supported Sources:
- ArXiv: Preprint scientific papers
- PubMed Central: Open-access biomedical papers
- Semantic Scholar: Academic search engine
- YouTube: Educational videos with transcripts
- Podcasts: RSS feed-based podcast episodes
Capabilities:
- PDF text extraction with PyMuPDF
- Diagram extraction and AI description using Gemini Vision
- Video transcript analysis
- Multi-modal content understanding
Search Strategy:
- BM25: Traditional keyword matching
- Semantic Search: Vector embeddings (384-dim)
- Field Boosting: title^3, abstract^2
- Combined Ranking: Optimized relevance
Features:
- Natural language queries via Google Gemini
- Automatic citation extraction
- Source tracking and attribution
- Related query suggestions
- Conversation memory
Dashboards:
- Collection statistics (by type, date, source)
- Search analytics
- Citation usage tracking
- Interactive filtering and export
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Interfaces β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββββ β
β β Gradio Web UI β β FastAPI Visualization β β
β β (Port 7860) β β Dashboard (Port 8000) β β
β ββββββββββββ¬βββββββββββ ββββββββββββ¬ββββββββββββββββ β
βββββββββββββββΌβββββββββββββββββββββββββββββββΌβββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Orchestration Layer β
β ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ β
β β Research Orchestrator β β Citation Tracker β β
β β (LangChain + Gemini) β β (Bibliography Export) β β
β ββββββββββββββ¬ββββββββββββββ ββββββββββββββββββββββββββββ β
βββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββ΄ββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββ
βΌ βΌ βΌ βΌ
βββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
βOpenSearchβ β Database β βCollectorsβ βProcessorsβ
β Index ββββββββββββ SQLite ββββββββ Layer βββββ Layer β
β (Vector β β(Tracking)β β β β β
β Search) β β β β β β β
ββββββββββββ ββββββββββββ ββββββ¬ββββββ ββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β ArXiv β β YouTube β β Podcasts β
β API β β API β β RSS β
ββββββββββββ ββββββββββββ ββββββββββββ
from multi_modal_rag.data_collectors import AcademicPaperCollector
# Initialize collector
collector = AcademicPaperCollector()
# Collect papers from ArXiv
papers = collector.collect_arxiv_papers("machine learning", max_results=20)
# Papers are automatically saved and tracked
print(f"Collected {len(papers)} papers")from multi_modal_rag.orchestration import ResearchOrchestrator
from multi_modal_rag.indexing import OpenSearchManager
# Initialize components
opensearch = OpenSearchManager()
orchestrator = ResearchOrchestrator("your-gemini-api-key", opensearch)
# Query the system
result = orchestrator.process_query(
"What is retrieval-augmented generation?",
"research_assistant"
)
print("Answer:", result['answer'])
print("Citations:", result['citations'])
print("Related Queries:", result['related_queries'])import requests
# Get collection statistics
response = requests.get("http://localhost:8000/api/statistics")
stats = response.json()
print(f"Total papers: {stats['by_type']['paper']}")
print(f"Total videos: {stats['by_type']['video']}")
print(f"Indexed items: {stats['indexed']}")
# Search collections
response = requests.get(
"http://localhost:8000/api/search",
params={"q": "transformers", "limit": 10}
)
results = response.json()- Python 3.9+ - Main programming language
- OpenSearch - Search and vector database
- Google Gemini - AI generation and vision analysis
- SQLite - Metadata tracking
- FastAPI - REST API framework
- Gradio - Web UI framework
- LangChain - AI orchestration
- SentenceTransformers - Semantic embeddings
- PyMuPDF - PDF processing
- yt-dlp - YouTube data extraction
- arxiv - ArXiv API client
- Total Code: ~3,000 lines of Python
- Documentation: 40 markdown files, 31,000+ lines
- Modules: 7 core modules
- API Endpoints: 6 REST endpoints
- Supported Sources: 5+ data sources
- Test Coverage: Comprehensive error handling
multi-modal-academic-research-system/
βββ main.py # Application entry point
βββ start_api_server.py # FastAPI server launcher
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
βββ CLAUDE.md # Claude Code instructions
β
βββ multi_modal_rag/ # Main package
β βββ data_collectors/ # Data collection modules
β β βββ paper_collector.py # ArXiv, PubMed, Scholar
β β βββ youtube_collector.py # YouTube videos
β β βββ podcast_collector.py # Podcast RSS feeds
β β
β βββ data_processors/ # Content processing
β β βββ pdf_processor.py # PDF extraction + Gemini Vision
β β βββ video_processor.py # Video analysis
β β
β βββ indexing/ # Search infrastructure
β β βββ opensearch_manager.py # Hybrid search engine
β β
β βββ database/ # Data tracking
β β βββ db_manager.py # SQLite manager
β β
β βββ api/ # REST API
β β βββ api_server.py # FastAPI server
β β βββ static/ # Visualization dashboard
β β βββ visualization.html
β β
β βββ orchestration/ # Query pipeline
β β βββ research_orchestrator.py # LangChain integration
β β βββ citation_tracker.py # Citation management
β β
β βββ ui/ # User interface
β β βββ gradio_app.py # Gradio UI
β β
β βββ logging_config.py # Logging setup
β
βββ data/ # Data storage
β βββ papers/ # Downloaded PDFs
β βββ videos/ # Video metadata
β βββ podcasts/ # Podcast data
β βββ processed/ # Processed content
β βββ collections.db # SQLite database
β
βββ logs/ # Application logs
β
βββ docs/ # Comprehensive documentation
βββ README.md # Documentation index
βββ architecture/ # System design
βββ modules/ # Module documentation
βββ setup/ # Installation & config
βββ tutorials/ # Step-by-step guides
βββ deployment/ # Deployment guides
βββ database/ # Database reference
βββ api/ # API reference
βββ troubleshooting/ # Problem solving
βββ advanced/ # Advanced topics
Detailed Project Structure β
# Required
GEMINI_API_KEY=your_api_key_here
# Optional (defaults shown)
OPENSEARCH_HOST=localhost
OPENSEARCH_PORT=9200Quick Start (Docker):
docker run -p 9200:9200 \
-e "discovery.type=single-node" \
opensearchproject/opensearch:latestpython main.py # Gradio UI on port 7860
python start_api_server.py # FastAPI on port 8000docker-compose up -d- Load balancing with Nginx
- Multi-node OpenSearch cluster
- Redis caching layer
- Automated backups
- Indexing Speed: 10-50 documents/second (bulk)
- Query Latency: 1-3 seconds (including LLM)
- Embedding Generation: ~50ms per document
- Database Queries: <10ms
- Storage: ~1MB per paper (PDF + metadata + embeddings)
- API keys stored in
.env(gitignored) - Local-only OpenSearch deployment
- CORS configured for localhost
- Input validation on all endpoints
- SQL injection prevention via parameterized queries
OpenSearch won't connect
# Check if OpenSearch is running
curl -X GET "localhost:9200"
# Restart OpenSearch
docker restart opensearchGemini API errors
- Verify API key in
.env - Check rate limits
- Ensure internet connection
Import errors
# Reinstall dependencies
pip install -r requirements.txt --force-reinstallComplete Troubleshooting Guide β
- Quick Start Guide - Get started in 5 minutes
- Collecting Papers Tutorial - First data collection
- UI Guide - Navigate the interface
- Architecture Overview - System design
- Module Documentation - Detailed API reference
- Extending Guide - Add new features
- Hybrid Search Algorithm - Search internals
- Performance Optimization - Speed improvements
- Custom Collectors - Add data sources
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Install development dependencies
pip install -r requirements-dev.txt
# Run tests
pytest
# Format code
black .
# Lint
flake8 .This project is licensed under the MIT License - see the LICENSE file for details.
- OpenSearch - Powerful search and analytics
- LangChain - AI orchestration framework
- Google Gemini - Advanced AI capabilities
- Gradio - Beautiful UI components
- ArXiv - Open-access scientific papers
- Semantic Scholar - Academic search engine
- YouTube - Educational video content
- PubMed Central - Biomedical literature
- Documentation: docs/README.md
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- β Multi-source data collection
- β Hybrid search with OpenSearch
- β Gemini integration
- β Citation tracking
- β Visualization dashboard
- π² Collaborative features (shared collections)
- π² Advanced analytics (trends, network graphs)
- π² Mobile-responsive UI
- π² Batch processing improvements
- π² Multi-language support
- π² Distributed search cluster
- π² Real-time collaboration
- π² Plugin architecture
- π² Advanced ML features
- π² Cloud deployment options
If you find this project useful, please consider giving it a star! β
Made with β€οΈ for the research community