Skip to content

thewolfcommander/python--project__multi-modal-academic-research-system

Repository files navigation

Multi-Modal Academic Research System

A comprehensive Python application for collecting, processing, and querying academic content from multiple sources using RAG (Retrieval-Augmented Generation) with OpenSearch and Google Gemini.

Python 3.9+ License Documentation

🎯 Overview

The Multi-Modal Academic Research System is a sophisticated platform that enables researchers, students, and professionals to:

  • Collect academic papers from ArXiv, PubMed Central, and Semantic Scholar
  • Process PDFs with text extraction and AI-powered diagram analysis
  • Index content using hybrid search (keyword + semantic) with OpenSearch
  • Query your knowledge base with natural language using Google Gemini
  • Track citations automatically with bibliography export (BibTeX, APA)
  • Visualize your collection with interactive dashboards

Key Features

βœ… Multi-Source Collection: Papers, YouTube lectures, and podcasts βœ… AI-Powered Processing: Gemini Vision for diagram analysis βœ… Hybrid Search: BM25 + semantic vector search βœ… Citation Tracking: Automatic extraction and bibliography export βœ… Interactive UI: Gradio web interface + FastAPI REST API βœ… Data Visualization: Real-time statistics and analytics βœ… SQLite Tracking: Complete metadata and collection history βœ… Free Technologies: Local deployment, no cloud costs

πŸš€ Quick Start

Prerequisites

  • Python 3.9 or higher
  • Docker (for OpenSearch)
  • Google Gemini API key (Get free key)

Installation

# 1. Clone the repository
git clone https://github.com/yourusername/multi-modal-academic-research-system.git
cd multi-modal-academic-research-system

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Set up environment variables
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY

# 5. Start OpenSearch
docker run -p 9200:9200 -e "discovery.type=single-node" opensearchproject/opensearch:latest

# 6. Run the application
python main.py

The Gradio UI will open at http://localhost:7860

πŸ“– Detailed Instructions: See Installation Guide and Quick Start Guide

πŸ“š Documentation

πŸ“˜ Documentation Options

Option 1: Interactive Documentation Site (Recommended)

# Serve documentation with live search and navigation
./serve_docs.sh  # Linux/Mac
serve_docs.bat   # Windows

# Visit http://127.0.0.1:8000

Built with MkDocs Material theme featuring:

  • πŸ” Full-text search
  • 🎨 Dark/light mode
  • πŸ“± Mobile responsive
  • πŸ”— Auto-generated navigation
  • πŸ“Š Built-in analytics

Option 2: Static Documentation

View Full Documentation β†’

Our documentation includes 40+ comprehensive guides totaling 31,000+ lines:

Getting Started

Architecture

Core Modules

Tutorials

Deployment

Reference

🎨 Features

1. Multi-Source Data Collection

Supported Sources:

  • ArXiv: Preprint scientific papers
  • PubMed Central: Open-access biomedical papers
  • Semantic Scholar: Academic search engine
  • YouTube: Educational videos with transcripts
  • Podcasts: RSS feed-based podcast episodes

Learn More β†’

2. AI-Powered Processing

Capabilities:

  • PDF text extraction with PyMuPDF
  • Diagram extraction and AI description using Gemini Vision
  • Video transcript analysis
  • Multi-modal content understanding

Learn More β†’

3. Hybrid Search Engine

Search Strategy:

  • BM25: Traditional keyword matching
  • Semantic Search: Vector embeddings (384-dim)
  • Field Boosting: title^3, abstract^2
  • Combined Ranking: Optimized relevance

Learn More β†’

4. Intelligent Query System

Features:

  • Natural language queries via Google Gemini
  • Automatic citation extraction
  • Source tracking and attribution
  • Related query suggestions
  • Conversation memory

Learn More β†’

5. Data Visualization

Dashboards:

  • Collection statistics (by type, date, source)
  • Search analytics
  • Citation usage tracking
  • Interactive filtering and export

Learn More β†’

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         User Interfaces                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Gradio Web UI     β”‚         β”‚  FastAPI Visualization   β”‚  β”‚
β”‚  β”‚  (Port 7860)        β”‚         β”‚  Dashboard (Port 8000)   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                              β”‚
              β–Ό                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Orchestration Layer                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Research Orchestrator   β”‚   β”‚   Citation Tracker       β”‚   β”‚
β”‚  β”‚  (LangChain + Gemini)    β”‚   β”‚   (Bibliography Export)  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β–Ό                       β–Ό                 β–Ό              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚OpenSearchβ”‚         β”‚ Database β”‚      β”‚Collectorsβ”‚   β”‚Processorsβ”‚
β”‚  Index  │◄─────────│ SQLite   │◄─────│  Layer   │◄──│  Layer   β”‚
β”‚ (Vector β”‚         β”‚(Tracking)β”‚      β”‚          β”‚   β”‚          β”‚
β”‚ Search) β”‚         β”‚          β”‚      β”‚          β”‚   β”‚          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                             β”‚
                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β–Ό                  β–Ό                  β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  ArXiv   β”‚      β”‚ YouTube  β”‚      β”‚ Podcasts β”‚
                    β”‚   API    β”‚      β”‚   API    β”‚      β”‚   RSS    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Detailed Architecture β†’

πŸ’» Usage Examples

Collecting Papers

from multi_modal_rag.data_collectors import AcademicPaperCollector

# Initialize collector
collector = AcademicPaperCollector()

# Collect papers from ArXiv
papers = collector.collect_arxiv_papers("machine learning", max_results=20)

# Papers are automatically saved and tracked
print(f"Collected {len(papers)} papers")

Querying the System

from multi_modal_rag.orchestration import ResearchOrchestrator
from multi_modal_rag.indexing import OpenSearchManager

# Initialize components
opensearch = OpenSearchManager()
orchestrator = ResearchOrchestrator("your-gemini-api-key", opensearch)

# Query the system
result = orchestrator.process_query(
    "What is retrieval-augmented generation?",
    "research_assistant"
)

print("Answer:", result['answer'])
print("Citations:", result['citations'])
print("Related Queries:", result['related_queries'])

Using the REST API

import requests

# Get collection statistics
response = requests.get("http://localhost:8000/api/statistics")
stats = response.json()

print(f"Total papers: {stats['by_type']['paper']}")
print(f"Total videos: {stats['by_type']['video']}")
print(f"Indexed items: {stats['indexed']}")

# Search collections
response = requests.get(
    "http://localhost:8000/api/search",
    params={"q": "transformers", "limit": 10}
)
results = response.json()

More Examples β†’

πŸ› οΈ Technology Stack

Core Technologies

  • Python 3.9+ - Main programming language
  • OpenSearch - Search and vector database
  • Google Gemini - AI generation and vision analysis
  • SQLite - Metadata tracking
  • FastAPI - REST API framework
  • Gradio - Web UI framework

Key Libraries

  • LangChain - AI orchestration
  • SentenceTransformers - Semantic embeddings
  • PyMuPDF - PDF processing
  • yt-dlp - YouTube data extraction
  • arxiv - ArXiv API client

Full Technology Stack β†’

πŸ“Š Project Statistics

  • Total Code: ~3,000 lines of Python
  • Documentation: 40 markdown files, 31,000+ lines
  • Modules: 7 core modules
  • API Endpoints: 6 REST endpoints
  • Supported Sources: 5+ data sources
  • Test Coverage: Comprehensive error handling

πŸ—‚οΈ Project Structure

multi-modal-academic-research-system/
β”œβ”€β”€ main.py                          # Application entry point
β”œβ”€β”€ start_api_server.py              # FastAPI server launcher
β”œβ”€β”€ requirements.txt                 # Python dependencies
β”œβ”€β”€ .env.example                     # Environment template
β”œβ”€β”€ CLAUDE.md                        # Claude Code instructions
β”‚
β”œβ”€β”€ multi_modal_rag/                 # Main package
β”‚   β”œβ”€β”€ data_collectors/             # Data collection modules
β”‚   β”‚   β”œβ”€β”€ paper_collector.py       # ArXiv, PubMed, Scholar
β”‚   β”‚   β”œβ”€β”€ youtube_collector.py     # YouTube videos
β”‚   β”‚   └── podcast_collector.py     # Podcast RSS feeds
β”‚   β”‚
β”‚   β”œβ”€β”€ data_processors/             # Content processing
β”‚   β”‚   β”œβ”€β”€ pdf_processor.py         # PDF extraction + Gemini Vision
β”‚   β”‚   └── video_processor.py       # Video analysis
β”‚   β”‚
β”‚   β”œβ”€β”€ indexing/                    # Search infrastructure
β”‚   β”‚   └── opensearch_manager.py    # Hybrid search engine
β”‚   β”‚
β”‚   β”œβ”€β”€ database/                    # Data tracking
β”‚   β”‚   └── db_manager.py            # SQLite manager
β”‚   β”‚
β”‚   β”œβ”€β”€ api/                         # REST API
β”‚   β”‚   β”œβ”€β”€ api_server.py            # FastAPI server
β”‚   β”‚   └── static/                  # Visualization dashboard
β”‚   β”‚       └── visualization.html
β”‚   β”‚
β”‚   β”œβ”€β”€ orchestration/               # Query pipeline
β”‚   β”‚   β”œβ”€β”€ research_orchestrator.py # LangChain integration
β”‚   β”‚   └── citation_tracker.py      # Citation management
β”‚   β”‚
β”‚   β”œβ”€β”€ ui/                          # User interface
β”‚   β”‚   └── gradio_app.py            # Gradio UI
β”‚   β”‚
β”‚   └── logging_config.py            # Logging setup
β”‚
β”œβ”€β”€ data/                            # Data storage
β”‚   β”œβ”€β”€ papers/                      # Downloaded PDFs
β”‚   β”œβ”€β”€ videos/                      # Video metadata
β”‚   β”œβ”€β”€ podcasts/                    # Podcast data
β”‚   β”œβ”€β”€ processed/                   # Processed content
β”‚   └── collections.db               # SQLite database
β”‚
β”œβ”€β”€ logs/                            # Application logs
β”‚
└── docs/                            # Comprehensive documentation
    β”œβ”€β”€ README.md                    # Documentation index
    β”œβ”€β”€ architecture/                # System design
    β”œβ”€β”€ modules/                     # Module documentation
    β”œβ”€β”€ setup/                       # Installation & config
    β”œβ”€β”€ tutorials/                   # Step-by-step guides
    β”œβ”€β”€ deployment/                  # Deployment guides
    β”œβ”€β”€ database/                    # Database reference
    β”œβ”€β”€ api/                         # API reference
    β”œβ”€β”€ troubleshooting/             # Problem solving
    └── advanced/                    # Advanced topics

Detailed Project Structure β†’

πŸ”§ Configuration

Environment Variables

# Required
GEMINI_API_KEY=your_api_key_here

# Optional (defaults shown)
OPENSEARCH_HOST=localhost
OPENSEARCH_PORT=9200

OpenSearch Setup

Quick Start (Docker):

docker run -p 9200:9200 \
  -e "discovery.type=single-node" \
  opensearchproject/opensearch:latest

Complete OpenSearch Setup β†’

🚒 Deployment Options

Local Development

python main.py  # Gradio UI on port 7860
python start_api_server.py  # FastAPI on port 8000

Docker Deployment

docker-compose up -d

Production Deployment

  • Load balancing with Nginx
  • Multi-node OpenSearch cluster
  • Redis caching layer
  • Automated backups

Deployment Guides β†’

πŸ“ˆ Performance

  • Indexing Speed: 10-50 documents/second (bulk)
  • Query Latency: 1-3 seconds (including LLM)
  • Embedding Generation: ~50ms per document
  • Database Queries: <10ms
  • Storage: ~1MB per paper (PDF + metadata + embeddings)

Performance Optimization β†’

πŸ”’ Security

  • API keys stored in .env (gitignored)
  • Local-only OpenSearch deployment
  • CORS configured for localhost
  • Input validation on all endpoints
  • SQL injection prevention via parameterized queries

Security Guide β†’

πŸ› Troubleshooting

Common Issues

OpenSearch won't connect

# Check if OpenSearch is running
curl -X GET "localhost:9200"

# Restart OpenSearch
docker restart opensearch

Gemini API errors

  • Verify API key in .env
  • Check rate limits
  • Ensure internet connection

Import errors

# Reinstall dependencies
pip install -r requirements.txt --force-reinstall

Complete Troubleshooting Guide β†’

πŸ“š Learning Resources

For Beginners

  1. Quick Start Guide - Get started in 5 minutes
  2. Collecting Papers Tutorial - First data collection
  3. UI Guide - Navigate the interface

For Developers

  1. Architecture Overview - System design
  2. Module Documentation - Detailed API reference
  3. Extending Guide - Add new features

For Advanced Users

  1. Hybrid Search Algorithm - Search internals
  2. Performance Optimization - Speed improvements
  3. Custom Collectors - Add data sources

🀝 Contributing

We welcome contributions! Here's how to get started:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
pytest

# Format code
black .

# Lint
flake8 .

Contributing Guide β†’

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

Open Source Projects

  • OpenSearch - Powerful search and analytics
  • LangChain - AI orchestration framework
  • Google Gemini - Advanced AI capabilities
  • Gradio - Beautiful UI components

Data Sources

  • ArXiv - Open-access scientific papers
  • Semantic Scholar - Academic search engine
  • YouTube - Educational video content
  • PubMed Central - Biomedical literature

πŸ“ž Support & Contact

πŸ—ΊοΈ Roadmap

Version 1.x (Current)

  • βœ… Multi-source data collection
  • βœ… Hybrid search with OpenSearch
  • βœ… Gemini integration
  • βœ… Citation tracking
  • βœ… Visualization dashboard

Version 2.0 (Planned)

  • πŸ”² Collaborative features (shared collections)
  • πŸ”² Advanced analytics (trends, network graphs)
  • πŸ”² Mobile-responsive UI
  • πŸ”² Batch processing improvements
  • πŸ”² Multi-language support

Version 3.0 (Future)

  • πŸ”² Distributed search cluster
  • πŸ”² Real-time collaboration
  • πŸ”² Plugin architecture
  • πŸ”² Advanced ML features
  • πŸ”² Cloud deployment options

Full Roadmap β†’

⭐ Star History

If you find this project useful, please consider giving it a star! ⭐


Made with ❀️ for the research community

πŸ“– Read the Full Documentation β†’

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published