A comprehensive AI-powered backend system for analyzing and Identifying scientific papers that contain curatable microbiome Signatures (curation readiness assessment.)
✅ Tested Setup: This project has been successfully built and tested on Ubuntu Linux with Docker. See SETUP_GUIDE.md for verified setup steps.
BioAnalyzer Backend is a specialized system that combines advanced AI analysis with comprehensive PubMed data retrieval to evaluate scientific papers for BugSigDB curation readiness. The system extracts 6 essential fields required for microbial signature curation and provides full text retrieval capabilities.
- 🔬 Paper Analysis: Extract 6 essential BugSigDB fields using AI
- 📥 Full Text Retrieval: Comprehensive PubMed and PMC data retrieval
- 🌐 REST API: Clean, well-documented REST endpoints
- 💻 CLI Tool: User-friendly command-line interface
- 📊 Multiple Formats: JSON, CSV,XML and table output formats
- ⚡ Batch Processing: Analyze multiple papers simultaneously
- 🔧 Docker Support: Containerized deployment
- 📈 Monitoring: Health checks and performance metrics
┌─────────────────────────────────────────────────────────────┐
│ BioAnalyzer Backend │
├─────────────────────────────────────────────────────────────┤
│ CLI Interface (cli.py) │
│ ├── Analysis Commands │
│ ├── Retrieval Commands │
│ └── System Management │
├─────────────────────────────────────────────────────────────┤
│ API Layer (app/api/) │
│ ├── FastAPI Application │
│ ├── Router Modules │
│ └── Request/Response Models │
├─────────────────────────────────────────────────────────────┤
│ Service Layer (app/services/) │
│ ├── PubMedRetriever │
│ ├── PubMedRetrievalService │
│ ├── StandalonePubMedRetriever │
│ └── BugSigDBAnalyzer │
├─────────────────────────────────────────────────────────────┤
│ Model Layer (app/models/) │
│ ├── GeminiQA │
│ ├── UnifiedQA │
│ └── Configuration │
├─────────────────────────────────────────────────────────────┤
│ Utility Layer (app/utils/) │
│ ├── Configuration Management │
│ ├── Text Processing │
│ └── Performance Logging │
└─────────────────────────────────────────────────────────────┘
- Input: PMID(s) via CLI or API
- Retrieval: Fetch metadata and full text from PubMed/PMC
- Analysis: AI-powered field extraction using Gemini
- Processing: Format and validate results
- Output: Structured data in multiple formats
- Docker (recommended) - Version 20.0+ with Docker Compose support
- Python 3.8+ (for local installation)
- NCBI API key (optional, for higher rate limits)
- Google Gemini API key (optional, for AI analysis)
This is the recommended approach as it avoids Python environment conflicts and provides a clean, isolated setup.
# 1. Navigate to the project directory
cd /path/to/bioanalyzer-backend
# 2. Install CLI commands system-wide
chmod +x install.sh
./install.sh
# 3. Build Docker image
docker compose build
# 4. Start the application
docker compose up -d
# 5. Verify installation
docker compose ps
curl http://localhost:8000/healthExpected Output:
{"status":"healthy","timestamp":"2025-10-23T17:52:40.249451+00:00","version":"1.0.0"}# Clone and setup
git clone https://github.com/waldronlab/bioanalyzer-backend.git
cd bioanalyzer-backend
# Create virtual environment (if python3-venv is available)
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r config/requirements.txt
pip install -e .
# Set up environment (optional)
cp .env.example .env
# Edit .env with your API keysAfter installation, verify the system is working:
# 1. Check Docker container status
docker compose ps
# 2. Test API health
curl http://localhost:8000/health
# 3. Test CLI commands (add to PATH first)
export PATH="$PATH:/home/ronald/.local/bin"
BioAnalyzer fields
BioAnalyzer status
# 4. View API documentation
# Open browser: http://localhost:8000/docsBioAnalyzer build # Build Docker containers
BioAnalyzer start # Start the application
BioAnalyzer stop # Stop the application
BioAnalyzer restart # Restart the application
BioAnalyzer status # Check system statusBioAnalyzer analyze 12345678 # Analyze single paper
BioAnalyzer analyze 12345678,87654321 # Analyze multiple papers
BioAnalyzer analyze --file pmids.txt # Analyze from file
BioAnalyzer fields # Show field informationBioAnalyzer retrieve 12345678 # Retrieve single paper
BioAnalyzer retrieve 12345678,87654321 # Retrieve multiple papers
BioAnalyzer retrieve --file pmids.txt # Retrieve from file
BioAnalyzer retrieve 12345678 --save # Save individual files
BioAnalyzer retrieve 12345678 --format json # JSON output
BioAnalyzer retrieve 12345678 --output results.csv # Save to fileGET /api/v1/analyze/{pmid} # Analyze paper for BugSigDB fields
GET /api/v1/fields # Get field information
GET /health # System health checkGET /api/v1/retrieve/{pmid} # Retrieve full paper data
POST /api/v1/retrieve/batch # Batch retrieval
GET /api/v1/retrieve/search?q=query # Search papersOnce started, access:
- Main Interface: http://localhost:3000
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
| Variable | Description | Required | Default |
|---|---|---|---|
NCBI_API_KEY |
NCBI API key for higher rate limits | Yes | - |
GEMINI_API_KEY |
Google Gemini API key for AI analysis | Yes | - |
EMAIL |
Contact email for API requests | Yes | bioanalyzer@example.com |
USE_FULLTEXT |
Enable full text retrieval | No | true |
API_TIMEOUT |
API request timeout (seconds) | No | 30 |
NCBI_RATE_LIMIT_DELAY |
Rate limiting delay (seconds) | No | 0.34 |
config/requirements.txt: Python dependenciesapp/utils/config.py: Application configurationdocker-compose.yml: Docker services configuration
The system analyzes papers for these critical fields:
- 🧬 Host Species: The organism being studied (Human, Mouse, Rat, etc.)
- 📍 Body Site: Sample collection location (Gut, Oral, Skin, etc.)
- 🏥 Condition: Disease/treatment/exposure being studied
- 🔬 Sequencing Type: Molecular method used (16S, metagenomics, etc.)
- 🌳 Taxa Level: Taxonomic level analyzed (phylum, genus, species, etc.)
- 👥 Sample Size: Number of samples or participants
- ✅ PRESENT: Information about the microbiom signtaure is complete and clear
⚠️ PARTIALLY_PRESENT: Some information available but incomplete- ❌ ABSENT: Information is missing
- Purpose: Core PubMed data retrieval
- Features: Metadata extraction, PMC full text retrieval
- Dependencies: requests, xml.etree.ElementTree
- Rate Limiting: NCBI-compliant request throttling
- Purpose: High-level paper retrieval service
- Features: Batch processing, file operations, result formatting
- Dependencies: PubMedRetriever
- Error Handling: Comprehensive error management
- Purpose: Lightweight retrieval without full service stack
- Features: Independent operation, minimal dependencies
- Use Case: CLI operations, standalone scripts
- Dependencies: requests only
- Purpose: AI-powered field extraction
- Features: Gemini integration, field analysis, confidence scoring
- Dependencies: google-generativeai, app.models
- Output: Structured field data with confidence scores
Input (PMID) → Retrieval → Analysis → Processing → Output
↓ ↓ ↓ ↓ ↓
CLI/API PubMedRetriever GeminiQA Formatter JSON/CSV/Table/XML
- Network Errors: Retry with exponential backoff
- API Errors: Graceful degradation with fallback methods
- Parsing Errors: Error reporting with context
- Missing Data: Clear indication of unavailable information
curl -X GET "http://localhost:8000/api/v1/analyze/12345678"curl -X GET "http://localhost:8000/api/v1/retrieve/12345678"curl -X POST "http://localhost:8000/api/v1/retrieve/batch" \
-H "Content-Type: application/json" \
-d '{"pmids": ["12345678", "87654321"]}'{
"pmid": "12345678",
"title": "Gut microbiome analysis in patients with IBD",
"abstract": "This study examines...",
"journal": "Nature Medicine",
"authors": ["Smith J", "Doe A"],
"publication_date": "2023",
"full_text": "Complete paper text...",
"has_full_text": true,
"fields": {
"host_species": {
"status": "PRESENT",
"value": "Human",
"confidence": 0.95
},
"body_site": {
"status": "PRESENT",
"value": "Gut",
"confidence": 0.92
}
},
"retrieval_timestamp": "2023-12-01T10:30:00Z"
}# All tests
pytest
# With coverage
pytest --cov=app
# Specific module
pytest tests/test_retrieval.py
# In Docker
docker exec -it bioanalyzer-api pytest- Unit tests for all service classes
- Integration tests for API endpoints
- CLI command testing
- Error handling validation
bioanalyzer-backend/
├── app/ # Main application code
│ ├── api/ # API layer
│ │ ├── app.py # FastAPI application
│ │ ├── models/ # Pydantic models
│ │ ├── routers/ # API routes
│ │ └── utils/ # API utilities
│ ├── models/ # AI models and configuration
│ │ ├── gemini_qa.py # Gemini AI integration
│ │ ├── unified_qa.py # Unified QA system
│ │ └── config.py # Model configuration
│ ├── services/ # Business logic services
│ │ ├── data_retrieval.py # Core PubMed retrieval
│ │ ├── pubmed_retrieval_service.py # High-level service
│ │ ├── standalone_pubmed_retriever.py # Standalone retriever
│ │ └── bugsigdb_analyzer.py # Field analysis
│ └── utils/ # Utilities and helpers
│ ├── config.py # Configuration management
│ ├── text_processing.py # Text processing utilities
│ └── performance_logger.py # Performance monitoring
├── config/ # Configuration files
│ ├── requirements.txt # Python dependencies
│ ├── setup.py # Package configuration
│ └── pytest.ini # Test configuration
├── docs/ # Documentation
│ ├── README.md # Main documentation
│ ├── DOCKER_DEPLOYMENT.md # Docker deployment guide
│ └── QUICKSTART.md # Quick start guide
├── scripts/ # Utility scripts
│ ├── log_cleanup.py # Log management
│ ├── performance_monitor.py # Performance monitoring
│ └── log_dashboard.py # Log visualization
├── tests/ # Test suite
│ ├── test_api.py # API tests
│ ├── test_retrieval.py # Retrieval tests
│ └── test_cli.py # CLI tests
├── cli.py # CLI interface
├── main.py # API server entry point
├── docker-compose.yml # Docker services
├── Dockerfile # Docker image
└── README.md # This file
# Build and start development environment
docker-compose up -d
# View logs
docker-compose logs -f
# Access container
docker exec -it bioanalyzer-api bash# Build production image
docker build -t bioanalyzer-backend:latest .
# Run production container
docker run -d -p 8000:8000 \
-e GEMINI_API_KEY=your_key \
-e NCBI_API_KEY=your_key \
bioanalyzer-backend:latest# Start API server
python main.py
# Or with uvicorn
uvicorn app.api.app:app --host 0.0.0.0 --port 8000 --reload# Direct CLI usage
python cli.py analyze 12345678
python cli.py retrieve 12345678 --save- Caching: Built-in caching for frequently accessed papers
- Rate Limiting: NCBI-compliant request throttling
- Batch Processing: Efficient multi-paper processing
- Async Support: Non-blocking API operations
- Memory Management: Optimized for large-scale analysis
- Analysis Speed: ~2-5 seconds per paper
- Retrieval Speed: ~1-3 seconds per paper
- Throughput: 10-20 papers per minute
- Memory Usage: ~100-200MB base + 50MB per concurrent request
# Clone repository
git clone https://github.com/waldronlab/bioanalyzer-backend.git
cd bioanalyzer-backend
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r config/requirements.txt
pip install -e .[dev]
# Set up pre-commit hooks
pre-commit install# Format code
black .
# Lint code
flake8 .
# Type checking
mypy .
# Run tests
pytest- Service Layer: Add new services in
app/services/ - API Endpoints: Add routes in
app/api/routers/ - CLI Commands: Extend
cli.pywith new commands - Models: Add Pydantic models in
app/api/models/
# Error: externally-managed-environment
# Solution: Use Docker (recommended) or install python3-venv
sudo apt install python3.12-venv python3-full
python3 -m venv .venv
source .venv/bin/activate# Error: docker-compose command not found
# Solution: Use newer Docker Compose syntax
docker compose build # Instead of docker-compose build
docker compose up -d # Instead of docker-compose up -d# Error: BioAnalyzer command not found
# Solution: Add to PATH
export PATH="$PATH:/home/<copmuter_name>/.local/bin"
# Or restart terminal after running ./install.sh# Check container status
docker compose ps
# Check logs
docker compose logs
# Restart if needed
docker compose restart# Warning: GeminiQA not initialized
# This is normal - system works without API keys
# For full functionality, set environment variables:
export GEMINI_API_KEY="your_gemini_key"
export NCBI_API_KEY="your_ncbi_key"Enable debug logging:
export LOG_LEVEL=DEBUG
python main.py- 🚀 Quick Start: QUICKSTART.md - Get running in 5 minutes
- 📖 Complete Setup Guide: SETUP_GUIDE.md - Detailed setup steps (tested & verified)
- 🏗️ Architecture Guide: ARCHITECTURE.md
- 🐳 Docker Guide: DOCKER_DEPLOYMENT.md
- 🔧 API Documentation: http://localhost:8000/docs (when running)
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add tests for new functionality
- Update documentation for API changes
- Use type hints for all functions
- Write comprehensive docstrings
This project is licensed under the MIT License - see the LICENSE file for details.
- BugSigDB Team: For the microbial signatures database
- NCBI: For PubMed data access and E-utilities API
- Google: For Gemini AI capabilities
- FastAPI: For the excellent web framework
- Docker: For containerization technology
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Project Wiki
Happy analyzing! 🧬🔬