Automated extraction of structured data from scientific papers with evidence grounding and validation.
Try it now: https://research-paper-analyzer-ack6bpdauvevnlnfbx7gpz.streamlit.app
Note: Demo uses DeepSeek v3.1 free tier. First run may take 30-60 seconds for model initialization.
Research Paper Analyzer transforms scientific PDFs into structured, machine-readable JSON with page-level evidence grounding. Built for researchers, ML engineers, and literature review automation, it extracts methods, results, datasets, and claims while maintaining traceability to source text.
Key differentiator: Evidence-grounded extraction with numeric consistency validation — not just LLM scraping.
PDF Input → Layout Analysis → LLM Extraction → Schema Validation → Evidence Linking → Structured JSON
- Manual paper analysis doesn't scale
- Existing tools extract text but lose structure
- LLM outputs are unreliable without validation
- No traceability from claims to source evidence
- ✅ Structured extraction with enforced schema
- ✅ Evidence grounding — every claim links to page + snippet
- ✅ Numeric consistency checks — catches hallucinated metrics
- ✅ Model-agnostic — works with DeepSeek, Gemma, Claude, GPT
- ✅ Production-validated — 100% success rate on 10 diverse papers
- PDF Parsing: Multi-layout understanding (text, figures, tables, equations)
- Context Building: Semantic chunking for 5 extraction heads (metadata, methods, results, limitations, summary)
- LLM Extraction: Parallel extraction with automatic repair
- Schema Enforcement: Pydantic models + JSON schema validation
- Evidence Attachment: Fuzzy matching (85% threshold) with page references
- Consistency Validation: Range checks, baseline logic, unit verification
| Metric | Score | Status |
|---|---|---|
| JSON Validity | 100% | ✅ Schema compliance |
| Evidence Precision | 81% | ✅ Grounding quality |
| Field Coverage | 100% | ✅ Complete extraction |
| Numeric Consistency | 100% | ✅ Zero hallucinations |
| Summary Alignment | 58% | 🟡 Context matching |
Benchmarked on 10 real papers (7-29 pages) including "Attention is All You Need"
- Streamlit Web UI: Interactive upload, extraction, visualization
- CLI Tool: Batch processing with checkpoint/resume
- Python API: Programmatic access for pipelines
┌─────────────────────────────────────────────────────────────┐
│ INPUT LAYER │
│ PDF Upload → PyMuPDF Parser → Text + Layout Extraction │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ PROCESSING LAYER │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Metadata │ │ Methods │ │ Results │ │
│ │ Extractor │ │ Extractor │ │ Extractor │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ ↓ ↓ ↓ │
│ ┌────────────────────────────────────────────────┐ │
│ │ LLM Backend (DeepSeek/Gemma) │ │
│ └────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ VALIDATION LAYER │
│ JSON Repair → Schema Validation → Numeric Consistency │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ EVIDENCE LAYER │
│ Fuzzy Matching → Page Linking → Snippet Extraction │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ OUTPUT LAYER │
│ Structured JSON + Evidence + Evaluation Metrics │
└─────────────────────────────────────────────────────────────┘
# Clone repository
git clone https://github.com/BhaveshBytess/research-paper-analyzer.git
cd research-paper-analyzer
# Create virtual environment (Python 3.10+)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set API key (OpenRouter for DeepSeek)
export OPENROUTER_API_KEY="your-key-here"Web UI (Recommended)
# Local
cd research-paper-analyzer
streamlit run app/app.py
# Or visit the live demo:
# https://research-paper-analyzer-ack6bpdauvevnlnfbx7gpz.streamlit.appCLI (Single Paper)
python run_now.py /path/to/paper.pdfCLI (Batch Processing)
python batch_deepseek_inline.py
# Processes 2 papers at a time with auto-resume
# Results saved to batch_eval_results/Python API
from research_paper_analyzer import extract_paper
result = extract_paper(
pdf_path="paper.pdf",
model="deepseek",
validate=True,
attach_evidence=True
)
print(result.json(indent=2)){
"title": "string",
"authors": ["string"],
"year": 2024,
"venue": "string | null",
"arxiv_id": "string | null",
"methods": [
{
"name": "string",
"category": "CNN | Transformer | GNN | ...",
"components": ["string"],
"description": "string"
}
],
"results": [
{
"dataset": "string",
"metric": "string",
"value": 0.95,
"unit": "%" | "points" | null,
"split": "test | val | train",
"higher_is_better": true,
"baseline": "string | null",
"ours_is": "string | null",
"confidence": 0.9
}
],
"tasks": ["string"],
"datasets": ["string"],
"limitations": "string | null",
"ethics": "string | null",
"summary": "string",
"evidence": {
"title": [{"page": 1, "snippet": "..."}],
"methods": [{"page": 3, "snippet": "..."}],
"results": [{"page": 7, "snippet": "..."}]
}
}- ✅ All numeric results must have valid
value(not null) - ✅ Percentages constrained to [0, 100]
- ✅ Confidence scores constrained to [0, 1]
- ✅
higher_is_betterlogic enforced vs. baseline - ✅ Evidence keys must match extracted fields
| Metric | Target | Achieved | Notes |
|---|---|---|---|
| JSON Validity | 100% | 100% | All outputs schema-compliant |
| Evidence Precision | ≥70% | 81% | Grounding to source text |
| Field Coverage | 100% | 100% | No missing required fields |
| Numeric Consistency | 100% | 100% | Zero hallucinated metrics |
| Processing Speed | <2 min/paper | ~2 min | On free-tier API |
- Papers: 10 (GNN methods, transformers, graph learning)
- Page range: 7-29 pages
- Venues: ICLR, NIPS, arXiv
- Success rate: 100% (10/10 papers extracted)
- Perfect papers: 2 (all metrics = 1.00)
Landmark paper tested: "Attention is All You Need" (Vaswani et al.) — successfully extracted all 8 authors, transformer components, and BLEU scores.
research-paper-analyzer/
├── research-paper-analyzer/
│ ├── app.py # Streamlit UI
│ ├── pdf_parser.py # PyMuPDF extraction
│ ├── llm_extractor.py # LLM extraction logic
│ ├── schema.py # Pydantic models
│ ├── evidence_matcher.py # Fuzzy evidence linking
│ └── eval_metrics.py # Consistency validation
├── batch_deepseek_inline.py # Batch evaluation script
├── create_visualizations.py # Metric visualization
├── requirements.txt # Python dependencies
├── README.md # This file
├── batch_eval_results/ # Evaluation results
│ ├── results.csv # Metrics table
│ ├── visualizations/ # 8 analysis charts
│ └── summary/ # Detailed reports
├── samples/ # Test papers + results
└── datastore/ # Cache + intermediate data
# Unit tests (TODO: expand coverage)
pytest tests/
# Integration test on sample paper
python test_consistency.py- Implement
BaseLLMExtractorinterface inllm_extractor.py - Add model config to
schema.py - Update
run_now.pywith new model option
See CONTRIBUTING.md for:
- Code style (Black, isort)
- PR checklist
- Issue templates
- Architecture decisions
- ❌ No OCR support — requires digital PDFs (not scanned images)
- ❌ No figure extraction — text-only for now
- ❌ English papers only — no multilingual support yet
⚠️ Free-tier rate limits — 16 req/min on OpenRouter (manageable for batch)
- 🟡 Summary alignment (58%) — threshold tuning needed
- 🟡 Complex table parsing — nested tables occasionally missed
- 🟡 Citation extraction — not yet implemented
- ✅ Numeric consistency — validated at 100% (production-ready)
- ✅ Schema compliance — 100% across all tests
- ✅ Evidence grounding — 81% precision (excellent)
- Core extraction pipeline
- Evidence grounding
- Numeric consistency validation
- Batch evaluation system
- Comprehensive benchmarks
- OCR support (scanned PDFs)
- Figure caption extraction
- Citation graph parsing
- Multi-paper comparison UI
- Active learning for uncertain extractions
- Multilingual support (non-English papers)
- Table structure extraction
- Equation parsing (LaTeX)
- Real-time collaboration (multi-user annotation)
- API service deployment (FastAPI + Docker)
If you use this tool in your research, please cite:
@software{research_paper_analyzer_2024,
author = {Bhavesh Bytess},
title = {Research Paper Analyzer: Evidence-Grounded PDF Extraction},
year = {2024},
url = {https://github.com/BhaveshBytess/research-paper-analyzer}
}MIT License - see LICENSE for details.
- PyMuPDF for robust PDF parsing
- OpenRouter for LLM API access
- DeepSeek for high-quality extraction
- Streamlit for rapid UI prototyping
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: 10bhavesh7.11@gmail.com
Maintained by: Bhavesh Bytess
Status: Active development, production-validated, seeking contributors
- 🌐 Live Demo — Try it in your browser
- 📊 Batch Evaluation Results
- 📈 Visualizations
- 🧪 Test Papers
- 🚀 Deployment Guide
- 📋 Project Completion Report
- 📝 API Documentation (coming soon)
- 🎯 Contribution Guide
Last Updated: 2025-11-03
Version: 1.1.0
Production Status: ✅ Validated (100% success rate on 10 papers)
