Systematic evaluation framework for assessing Model Context Protocol (MCP) servers that provide access to scientific literature.
This project evaluates multiple literature MCP servers across defined test cases spanning title retrieval, table extraction, conclusion parsing, and content summarization. We use DeepEval's Correctness and Hallucination metrics to evaluate performance on challenging tasks in scientific information retrieval.
- artl-mcp - Simple MCP for retrieving literature using DOI, PMC, PMID
- biomcp - Specialized biomedical knowledge from authoritative sources
- simple-pubmed - PubMed search with field-specific queries
- pubmed-mcp - Advanced PubMed access with filtering
# Clone the repository
git clone https://github.com/contextualizer-ai/mcp_literature_eval.git
cd mcp_literature_eval
# Install dependencies with uv
uv syncThis project uses a specific branch of metacoder for evaluation:
# To update metacoder
uv lock --upgrade-package metacoder
uv sync --reinstallUse the Jupyter notebooks to run evaluations and generate visualizations. The notebooks provide step-by-step instructions with environment setup, execution commands, and progress monitoring.
Compare MCP performance across different coding agents:
-
Run evaluations:
notebook/experiment_1_run_evaluations.ipynb- Claude Code agent
- Goose agent
- Gemini CLI agent
-
Analyze results:
notebook/experiment_1_cross_agent_analysis.ipynb- Pass rate comparisons
- Statistical tests
- Publication-quality figures
Compare MCP performance across different LLM models using Goose agent:
-
Run evaluations:
notebook/experiment_2_run_evaluations.ipynb- gpt-4o
- gpt-5
- gpt-4o-mini
-
Analyze results:
notebook/experiment_2_cross_model_analysis.ipynb- Model performance comparison
- Score distributions
- Cross-experiment analysis
For manual runs or custom configurations:
# Example: Run gpt-4o-mini evaluation
export OPENAI_API_KEY=$(cat ~/openai.key)
export PUBMED_EMAIL=justinreese@lbl.gov
export PUBMED_API_KEY=01eec0a16472164c6d69163bd28368311808
uv run metacoder eval project/literature_mcp_eval_config_goose_gpt4o_mini.yaml \
-o results/compare_models/goose_gpt4o_mini_$(date +%Y%m%d).yamlDuration: Each full evaluation takes 2-3 hours (100 evaluations).
Directory Structure:
results/compare_agents/- Cross-agent evaluation results (Experiment 1)results/compare_models/- Cross-model evaluation results (Experiment 2)results/figures/- Generated plots from analysis notebooks
For MCP servers: Some MCPs require API keys or contact information:
export PUBMED_EMAIL="your.email@example.com"
export PUBMED_API_KEY="your_api_key_here"For evaluation metrics: DeepEval uses OpenAI for the CorrectnessMetric evaluator. Set the API key:
# Load from file (recommended)
export OPENAI_API_KEY=$(cat ~/openai.key)
# Or set directly
export OPENAI_API_KEY="sk-proj-..."Or configure MCP-specific settings directly in YAML files under servers.<server>.env.
The evaluation includes 25 test cases defined in YAML configuration files (project/literature_mcp_eval_config*.yaml). Each test case specifies:
- Question (
input): What to ask the agent (e.g., "What is the title of PMID:28027860?") - Expected answer (
expected_output): The correct answer for semantic comparison - Test category (
group): Type of retrieval task - Success threshold (
threshold): Minimum similarity score (0.9 = 90%)
Example test case:
- name: PMID_28027860_Title
group: "Metadata"
input: "What is the title of PMID:28027860?"
expected_output: "From nocturnal frontal lobe epilepsy to Sleep-Related Hypermotor Epilepsy: A 35-year diagnostic challenge"
threshold: 0.9The 25 test cases cover 6 case groups:
- Metadata (8 tests) - Title, DOI, publisher retrieval
- Text extraction (9 tests) - Section content, sentences, headers
- Table / Figure / Figure Legend extraction (4 tests) - Structured data
- Supplementary material (2 tests) - Supplemental file detection and retrieval
- Publication status (1 test) - Retraction detection
- Summarization (2 tests) - Content synthesis and analysis
- 4 MCP servers Γ 25 test cases = 100 evaluations per agent/model
- Each test is scored using semantic similarity (CorrectnessMetric via DeepEval)
- Pass/fail determined by threshold (typically 0.9 for 90% semantic match)
π Full test case details: See TEST_CASES.md for the complete list of all 25 test questions and expected answers, organized by category.
All analysis is performed in Jupyter notebooks located in notebook/:
Notebook: experiment_1_cross_agent_analysis.ipynb
Generates:
- Overall pass rate comparison across agents (Claude Code, Goose, Gemini)
- MCP-specific performance by agent
- Case group sensitivity heatmaps
- Score distribution violin plots
- Statistical tests (chi-square, Mann-Whitney U)
- Publication-quality figures saved to
results/figures/
Notebook: experiment_2_cross_model_analysis.ipynb
Generates:
- Overall pass rate comparison across models (gpt-4o, gpt-5, gpt-4o-mini)
- Model Γ MCP performance matrix
- Task type sensitivity by model
- Cross-experiment comparison (agent vs. model effects)
- Statistical significance tests
- Publication-quality figures saved to
results/figures/
- Pass rate: % of tests scoring β₯ threshold (0.9)
- Semantic similarity: Cosine similarity between actual and expected outputs
- Case group analysis: Performance breakdown by task type
- Statistical tests: Chi-square and Mann-Whitney U tests for significance
mcp_literature_eval/
βββ README.md # Project overview and quick start
βββ TEST_CASES.md # Detailed documentation of all 25 test cases
βββ notebook/ # Jupyter notebooks (primary interface)
β βββ experiment_1_run_evaluations.ipynb # Run Experiment 1
β βββ experiment_1_cross_agent_analysis.ipynb # Analyze Experiment 1
β βββ experiment_2_run_evaluations.ipynb # Run Experiment 2
β βββ experiment_2_cross_model_analysis.ipynb # Analyze Experiment 2
β βββ attic/ # Archived notebooks
βββ project/ # Test configurations (YAML)
β βββ literature_mcp_eval_config_goose_gpt4o.yaml # Goose + gpt-4o (baseline)
β βββ literature_mcp_eval_config_goose_gpt5.yaml # Goose + gpt-5
β βββ literature_mcp_eval_config_goose_gpt4o_mini.yaml # Goose + gpt-4o-mini
β βββ literature_mcp_eval_config_goose_claude.yaml # Goose + claude-sonnet-4
β βββ literature_mcp_eval_config_claude.yaml # Claude Code agent
β βββ literature_mcp_eval_config_gemini.yaml # Gemini CLI agent
βββ results/
β βββ compare_agents/ # Experiment 1 results
β βββ compare_models/ # Experiment 2 results
β βββ figures/ # Generated plots (PNG)
βββ notes/ # Experiment documentation
β βββ EXPERIMENT_1_RESULTS.md
β βββ EXPERIMENT_2_CROSS_MODEL.md
βββ src/ # Source code
βββ tests/ # Unit tests
Question: Does the choice of coding agent affect MCP retrieval performance?
Compares three coding agents using their default models:
- Claude Code (claude-sonnet-4) - Anthropic's official CLI
- Goose (gpt-4o) - Block's open-source coding agent
- Gemini CLI (gemini-1.5-pro-002) - Google's coding agent
Key Finding: Agent choice has a 32 percentage point spread in pass rates (47% β 15%).
Documentation: notes/EXPERIMENT_1_RESULTS.md
Question: Does the choice of LLM model affect MCP retrieval performance when using the same agent?
Compares three OpenAI models using Goose agent:
- gpt-4o - Baseline (from Experiment 1)
- gpt-5 - Latest flagship model
- gpt-4o-mini - Smaller, faster model
Objective: Isolate model effects from agent architecture effects.
Documentation: notes/EXPERIMENT_2_CROSS_MODEL.md
- NBK1256 test often hangs - May require individual execution
- PMC117972 hangs with pubmed-mcp - Known agent stability issue
- Full evaluation takes 2-4 hours - Individual test cases take 2-5 minutes
- Start here: Open the Jupyter notebooks in
notebook/- they contain step-by-step instructions - Test cases: See TEST_CASES.md for detailed documentation of all 25 test questions
- Experiment details: Check
notes/EXPERIMENT_1_RESULTS.mdandnotes/EXPERIMENT_2_CROSS_MODEL.md - Metacoder framework: metacoder documentation
- Evaluation metrics: DeepEval documentation
This is a research evaluation project. For questions or collaboration:
- Open an issue on GitHub
- Contact the authors (see
pyproject.toml)
BSD-3-Clause (see LICENSE)
- Justin Reese (justinreese@lbl.gov)
- Charles Parker (ctparker@lbl.gov)
- Mark Miller (mam@lbl.gov)
- Chris Mungall (CJMungall@lbl.gov)
- Metacoder: https://github.com/ai4curation/metacoder
- DeepEval: https://docs.deepeval.com/
- Model Context Protocol: https://modelcontextprotocol.io/