This is a benchmark for PDF RAG embedding systems. Given a bunch of PDFs, I try to measure the performance of different embedding models - from text to multimodal to multi vector and a combination of them.
For the text embedding models, marker was used to parse the pdf pages to text. I tried other parsers like contextual.ai and landing.ai and their results are included at the end.
PDFs are taken from OHR-Bench. Not all PDFs are used in this benchmark.
All these result are NCDG@5. Individual results are in the /results folder along with RECALL@5.
Models:
- Voyage:
voyage-3.5
- Cohere:
cohere-embed-v4
- Gemini:
gemini-embedding-001
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Voyage | 84.0% | 85.3% | 88.1% | 75.9% | 87.4% | 84.0% | 90.8% | 94.4% |
Cohere | 81.7% | 82.8% | 88.7% | 71.1% | 83.6% | 83.5% | 89.1% | 94.9% |
Gemini | 76.6% | 77.4% | 80.8% | 63.1% | 80.9% | 80.5% | 79.7% | 93.5% |
After reranking with voyage reranker:
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Voyage | 89.5% | 91.6% | 93.2% | 79.8% | 93.4% | 91.3% | 96.5% | 97.8% |
Cohere | 89.3% | 91.9% | 91.7% | 79.4% | 92.2% | 91.7% | 96.8% | 98.1% |
Gemini | 85.8% | 89.0% | 86.8% | 73.7% | 90.4% | 89.3% | 91.1% | 97.2% |
After adding sparse embedder and reranking with voyage reranker:
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Voyage | 91.1% | 93.5% | 95.1% | 81.7% | 94.8% | 92.7% | 97.3% | 98.6% |
Cohere | 90.6% | 93.2% | 92.7% | 81.3% | 94.1% | 92.4% | 96.8% | 98.9% |
Gemini | 89.2% | 92.3% | 91.3% | 79.0% | 93.0% | 91.7% | 93.2% | 98.3% |
Models:
- Voyage:
voyage-multimodal-3
- Cohere:
cohere-embed-v4
- Jina:
jina-embeddings-v4
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Cohere | 85.8% | 87.5% | 89.6% | 76.9% | 87.0% | 87.6% | 93.8% | 96.8% |
Voyage | 83.6% | 86.7% | 88.4% | 74.2% | 85.8% | 84.7% | 85.3% | 96.5% |
Jina | 81.5% | 85.7% | 86.8% | 70.1% | 82.2% | 84.0% | 84.3% | 96.2% |
After reranking with voyage reranker:
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Cohere | 91.5% | 94.5% | 93.8% | 83.6% | 92.8% | 93.1% | 99.0% | 98.6% |
Voyage | 90.3% | 94.0% | 93.8% | 82.2% | 91.2% | 91.5% | 95.9% | 98.9% |
Jina | 88.9% | 93.0% | 93.4% | 78.7% | 89.8% | 92.1% | 92.1% | 98.7% |
After adding sparse embedder and reranking with voyage reranker:
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Cohere | 92.6% | 95.2% | 94.9% | 85.5% | 94.1% | 93.8% | 99.0% | 99.0% |
Voyage | 92.2% | 95.2% | 95.3% | 85.6% | 92.8% | 93.1% | 97.4% | 98.9% |
Jina | 91.4% | 94.4% | 94.8% | 83.7% | 92.3% | 93.4% | 94.0% | 99.1% |
Multi-vector embedders are embedders that embed text and image into multiple vectors.
Models:
- Jina:
jina-embeddings-v4
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Jina | 87.9% | 91.8% | 91.2% | 78.4% | 89.6% | 89.4% | 96.8% | 96.6% |
After reranking with voyage reranker:
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Jina | 91.3% | 94.5% | 94.1% | 82.4% | 93.7% | 93.1% | 98.4% | 98.2% |
After adding sparse embedder and reranking with voyage reranker:
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Jina | 91.9% | 95.2% | 94.6% | 83.3% | 94.8% | 93.1% | 98.4% | 98.6% |
Models:
- Jina:
jina-embeddings-v4
- Colnomic:
nomic-ai/colnomic-embed-multimodal-3b
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Colnomic | 91.4% | 95.8% | 92.9% | 83.0% | 94.4% | 93.0% | 91.3% | 98.5% |
Jina | 89.9% | 94.6% | 92.1% | 79.6% | 92.1% | 93.6% | 91.2% | 98.5% |
After reranking with voyage reranker:
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Colnomic | 93.5% | 97.1% | 94.4% | 87.2% | 94.8% | 94.7% | 97.2% | 98.8% |
Jina | 93.0% | 97.2% | 94.7% | 85.5% | 94.5% | 94.6% | 96.1% | 99.1% |
After adding sparse embedder and reranking with voyage reranker:
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Colnomic | 93.8% | 97.4% | 95.1% | 87.9% | 95.2% | 94.5% | 98.0% | 98.9% |
Jina | 93.6% | 97.3% | 95.1% | 87.5% | 95.0% | 94.4% | 96.1% | 99.2% |
Models:
- Voyage image + Voyage text + sparse embedder + reranker (A)
- Voyage image + Jina text multi-vector + sparse embedder + reranker (B)
- Voyage image + Jina image multi-vector + sparse embedder + reranker (C)
- Voyage image + Colnomic image multi-vector + sparse embedder + reranker (D)
- Cohere image + Colnomic image multi-vector + sparse embedder + reranker (E)
Model | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
E | 94.0% | 97.5% | 94.9% | 88.1% | 95.2% | 94.7% | 99.0% | 99.8% |
C | 93.8% | 97.6% | 95.6% | 87.9% | 94.9% | 94.3% | 99.0% | 99.1% |
D | 93.8% | 97.4% | 95.5% | 88.0% | 95.1% | 94.5% | 98.2% | 98.7% |
B | 93.6% | 96.4% | 95.7% | 87.7% | 94.7% | 94.2% | 99.0% | 99.0% |
A | 92.2% | 95.3% | 95.2% | 85.6% | 92.7% | 93.0% | 97.4% | 98.9% |
Models:
- Voyage:
voyage-3.5
Parsers:
- Ground truth data
- Marker
- Contextual.ai
- Landing.ai
Parser | Overall | Academic | Administration | Finance | Law | Manual | News | Textbook |
---|---|---|---|---|---|---|---|---|
Ground Truth | 86.4% | 87.3% | 90.2% | 80.2% | 87.7% | 86.7% | 91.2% | 95.1% |
Contextual.ai | 84.2% | 85.8% | 88.5% | 77.2% | 86.9% | 83.0% | 89.8% | 94.7% |
Marker | 84.0% | 85.3% | 88.1% | 75.9% | 87.4% | 84.0% | 90.8% | 94.4% |
Landing.ai | 82.9% | 84.8% | 86.5% | 75.3% | 87.0% | 82.0% | 85.3% | 93.3% |
- Rerankers are very effective
- Multimodal embedders are very effective for figures, tables, charts, etc. but text with detailed description also goes a long way.
- Combining different types of embedders is effective (dense, sparse, multi-vector) and combining models is also effective. These increase cost significantly but improve results.
- Multi-vector > dense > sparse in terms of individual performance
- Dense ~ sparse > multi-vector in terms of api cost and latency
- Contextual.ai and Landing.ai are good parsers but much more expensive than marker (~10x-13x). Not worth the cost for most usecases, especially if you just need markdown.
- Voyage, Cohere and Marker are the most reliable and low latency providers.
- Gemini has low rate limits until you've spent at least $250 on their platform.
- Jina API is not very reliable - a lot of 5xx errors. Same with Landing.ai. I had to run the parsing for some PDFs multiple times - leading to much more cost and time.
All the embeddings are calculated and stored in the data.zip file. It's a big file of around ~14.5GB. The size on disk after unzipping is ~42GB.
Download it and unzip it in the root of the repository.
Prerequisites:
uv sync
source .venv/bin/activate
Some examples:
# Run embedder (voyage text using marker parser), sparse embedder (BGE-M3 with marker parser) and reranker (voyage) on all PDFs
python main.py end2end -e marker_voyage_text -s marker_bge_m3 -r voyage '*/*'
# Run embedder (voyage image), sparse embedder (BGE-M3 with marker parser) and reranker (voyage) on finance PDFs
python main.py end2end -e voyage_image -s marker_bge_m3 -r voyage 'finance/*'
# Run multi embedder (jina image), sparse embedder (BGE-M3 with marker parser) and reranker (RRF) on textbook PDFs
python main.py end2end -m jina_image -s marker_bge_m3 -r rrf 'textbook/*'
# Run embedder (cohere image) on all PDFs
python main.py end2end -e cohere_image '*/*'
To understand all the options, run:
python main.py end2end -h
This is what you'll see:
Usage: python main.py end2end [OPTIONS] PDF_KEY_GLOB
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────╮
│ * pdf_key_glob TEXT PDF key glob [default: None] [required] │
╰────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────╮
│ --embedders -e [voyage_image|cohere_image| Embedders to use │
│ jina_image|ground_truth_voy [default: None] │
│ age_text|marker_voyage_text │
│ |contextual_voyage_text|lan │
│ ding_voyage_text|marker_coh │
│ ere_text|marker_gemini_text │
│ ] │
│ --multi-embedders -m [jina_image|colnomic_image| Multi embedders to use │
│ marker_jina_text] [default: None] │
│ --sparse-embedders -s [ground_truth_bge_m3|marker Sparse embedders to use │
│ _bge_m3|contextual_bge_m3|l [default: None] │
│ anding_bge_m3] │
│ --sparse-results-weight -w FLOAT Sparse results weight │
│ [default: 0.5] │
│ --top-k INTEGER Top k results [default: 5] │
│ --reranker -r [rrf|voyage|jina_image|jina Reranker to use │
│ _text] [default: rrf] │
│ --help -h Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────╯
This project is licensed under the MIT License - see the LICENSE file for details.