Skip to content

gpahal/pdf-rag-embed-bench

Repository files navigation

PDF RAG Embed Bench

This is a benchmark for PDF RAG embedding systems. Given a bunch of PDFs, I try to measure the performance of different embedding models - from text to multimodal to multi vector and a combination of them.

For the text embedding models, marker was used to parse the pdf pages to text. I tried other parsers like contextual.ai and landing.ai and their results are included at the end.

PDFs are taken from OHR-Bench. Not all PDFs are used in this benchmark.

All these result are NCDG@5. Individual results are in the /results folder along with RECALL@5.

Dense embedders

Text dense embedders

Models:

  • Voyage: voyage-3.5
  • Cohere: cohere-embed-v4
  • Gemini: gemini-embedding-001
Model Overall Academic Administration Finance Law Manual News Textbook
Voyage 84.0% 85.3% 88.1% 75.9% 87.4% 84.0% 90.8% 94.4%
Cohere 81.7% 82.8% 88.7% 71.1% 83.6% 83.5% 89.1% 94.9%
Gemini 76.6% 77.4% 80.8% 63.1% 80.9% 80.5% 79.7% 93.5%

After reranking with voyage reranker:

Model Overall Academic Administration Finance Law Manual News Textbook
Voyage 89.5% 91.6% 93.2% 79.8% 93.4% 91.3% 96.5% 97.8%
Cohere 89.3% 91.9% 91.7% 79.4% 92.2% 91.7% 96.8% 98.1%
Gemini 85.8% 89.0% 86.8% 73.7% 90.4% 89.3% 91.1% 97.2%

After adding sparse embedder and reranking with voyage reranker:

Model Overall Academic Administration Finance Law Manual News Textbook
Voyage 91.1% 93.5% 95.1% 81.7% 94.8% 92.7% 97.3% 98.6%
Cohere 90.6% 93.2% 92.7% 81.3% 94.1% 92.4% 96.8% 98.9%
Gemini 89.2% 92.3% 91.3% 79.0% 93.0% 91.7% 93.2% 98.3%

Image dense embedders

Models:

  • Voyage: voyage-multimodal-3
  • Cohere: cohere-embed-v4
  • Jina: jina-embeddings-v4
Model Overall Academic Administration Finance Law Manual News Textbook
Cohere 85.8% 87.5% 89.6% 76.9% 87.0% 87.6% 93.8% 96.8%
Voyage 83.6% 86.7% 88.4% 74.2% 85.8% 84.7% 85.3% 96.5%
Jina 81.5% 85.7% 86.8% 70.1% 82.2% 84.0% 84.3% 96.2%

After reranking with voyage reranker:

Model Overall Academic Administration Finance Law Manual News Textbook
Cohere 91.5% 94.5% 93.8% 83.6% 92.8% 93.1% 99.0% 98.6%
Voyage 90.3% 94.0% 93.8% 82.2% 91.2% 91.5% 95.9% 98.9%
Jina 88.9% 93.0% 93.4% 78.7% 89.8% 92.1% 92.1% 98.7%

After adding sparse embedder and reranking with voyage reranker:

Model Overall Academic Administration Finance Law Manual News Textbook
Cohere 92.6% 95.2% 94.9% 85.5% 94.1% 93.8% 99.0% 99.0%
Voyage 92.2% 95.2% 95.3% 85.6% 92.8% 93.1% 97.4% 98.9%
Jina 91.4% 94.4% 94.8% 83.7% 92.3% 93.4% 94.0% 99.1%

Multi-vector embedders

Multi-vector embedders are embedders that embed text and image into multiple vectors.

Text multi-vector embedders

Models:

  • Jina: jina-embeddings-v4
Model Overall Academic Administration Finance Law Manual News Textbook
Jina 87.9% 91.8% 91.2% 78.4% 89.6% 89.4% 96.8% 96.6%

After reranking with voyage reranker:

Model Overall Academic Administration Finance Law Manual News Textbook
Jina 91.3% 94.5% 94.1% 82.4% 93.7% 93.1% 98.4% 98.2%

After adding sparse embedder and reranking with voyage reranker:

Model Overall Academic Administration Finance Law Manual News Textbook
Jina 91.9% 95.2% 94.6% 83.3% 94.8% 93.1% 98.4% 98.6%

Image multi-vector embedders

Models:

  • Jina: jina-embeddings-v4
  • Colnomic: nomic-ai/colnomic-embed-multimodal-3b
Model Overall Academic Administration Finance Law Manual News Textbook
Colnomic 91.4% 95.8% 92.9% 83.0% 94.4% 93.0% 91.3% 98.5%
Jina 89.9% 94.6% 92.1% 79.6% 92.1% 93.6% 91.2% 98.5%

After reranking with voyage reranker:

Model Overall Academic Administration Finance Law Manual News Textbook
Colnomic 93.5% 97.1% 94.4% 87.2% 94.8% 94.7% 97.2% 98.8%
Jina 93.0% 97.2% 94.7% 85.5% 94.5% 94.6% 96.1% 99.1%

After adding sparse embedder and reranking with voyage reranker:

Model Overall Academic Administration Finance Law Manual News Textbook
Colnomic 93.8% 97.4% 95.1% 87.9% 95.2% 94.5% 98.0% 98.9%
Jina 93.6% 97.3% 95.1% 87.5% 95.0% 94.4% 96.1% 99.2%

Combination of embedders

Models:

  • Voyage image + Voyage text + sparse embedder + reranker (A)
  • Voyage image + Jina text multi-vector + sparse embedder + reranker (B)
  • Voyage image + Jina image multi-vector + sparse embedder + reranker (C)
  • Voyage image + Colnomic image multi-vector + sparse embedder + reranker (D)
  • Cohere image + Colnomic image multi-vector + sparse embedder + reranker (E)
Model Overall Academic Administration Finance Law Manual News Textbook
E 94.0% 97.5% 94.9% 88.1% 95.2% 94.7% 99.0% 99.8%
C 93.8% 97.6% 95.6% 87.9% 94.9% 94.3% 99.0% 99.1%
D 93.8% 97.4% 95.5% 88.0% 95.1% 94.5% 98.2% 98.7%
B 93.6% 96.4% 95.7% 87.7% 94.7% 94.2% 99.0% 99.0%
A 92.2% 95.3% 95.2% 85.6% 92.7% 93.0% 97.4% 98.9%

Results for other parsers

Models:

  • Voyage: voyage-3.5

Parsers:

Parser Overall Academic Administration Finance Law Manual News Textbook
Ground Truth 86.4% 87.3% 90.2% 80.2% 87.7% 86.7% 91.2% 95.1%
Contextual.ai 84.2% 85.8% 88.5% 77.2% 86.9% 83.0% 89.8% 94.7%
Marker 84.0% 85.3% 88.1% 75.9% 87.4% 84.0% 90.8% 94.4%
Landing.ai 82.9% 84.8% 86.5% 75.3% 87.0% 82.0% 85.3% 93.3%

Notes based on results

  • Rerankers are very effective
  • Multimodal embedders are very effective for figures, tables, charts, etc. but text with detailed description also goes a long way.
  • Combining different types of embedders is effective (dense, sparse, multi-vector) and combining models is also effective. These increase cost significantly but improve results.
  • Multi-vector > dense > sparse in terms of individual performance
  • Dense ~ sparse > multi-vector in terms of api cost and latency
  • Contextual.ai and Landing.ai are good parsers but much more expensive than marker (~10x-13x). Not worth the cost for most usecases, especially if you just need markdown.

Notes on providers

  • Voyage, Cohere and Marker are the most reliable and low latency providers.
  • Gemini has low rate limits until you've spent at least $250 on their platform.
  • Jina API is not very reliable - a lot of 5xx errors. Same with Landing.ai. I had to run the parsing for some PDFs multiple times - leading to much more cost and time.

Running locally

All the embeddings are calculated and stored in the data.zip file. It's a big file of around ~14.5GB. The size on disk after unzipping is ~42GB.

Download it and unzip it in the root of the repository.

Prerequisites:

Setup

uv sync
source .venv/bin/activate

Run

Some examples:

# Run embedder (voyage text using marker parser), sparse embedder (BGE-M3 with marker parser) and reranker (voyage) on all PDFs
python main.py end2end -e marker_voyage_text -s marker_bge_m3 -r voyage '*/*'

# Run embedder (voyage image), sparse embedder (BGE-M3 with marker parser) and reranker (voyage) on finance PDFs
python main.py end2end -e voyage_image -s marker_bge_m3 -r voyage 'finance/*'

# Run multi embedder (jina image), sparse embedder (BGE-M3 with marker parser) and reranker (RRF) on textbook PDFs
python main.py end2end -m jina_image -s marker_bge_m3 -r rrf 'textbook/*'

# Run embedder (cohere image) on all PDFs
python main.py end2end -e cohere_image '*/*'

To understand all the options, run:

python main.py end2end -h

This is what you'll see:

Usage: python main.py end2end [OPTIONS] PDF_KEY_GLOB

╭─ Arguments ────────────────────────────────────────────────────────────────────────────────╮
│ *    pdf_key_glob      TEXT  PDF key glob [default: None] [required]                       │
╰────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────╮
│ --embedders              -e      [voyage_image|cohere_image|  Embedders to use             │
│                                  jina_image|ground_truth_voy  [default: None]              │
│                                  age_text|marker_voyage_text                               │
│                                  |contextual_voyage_text|lan                               │
│                                  ding_voyage_text|marker_coh                               │
│                                  ere_text|marker_gemini_text                               │
│                                  ]                                                         │
│ --multi-embedders        -m      [jina_image|colnomic_image|  Multi embedders to use       │
│                                  marker_jina_text]            [default: None]              │
│ --sparse-embedders       -s      [ground_truth_bge_m3|marker  Sparse embedders to use      │
│                                  _bge_m3|contextual_bge_m3|l  [default: None]              │
│                                  anding_bge_m3]                                            │
│ --sparse-results-weight  -w      FLOAT                        Sparse results weight        │
│                                                               [default: 0.5]               │
│ --top-k                          INTEGER                      Top k results [default: 5]   │
│ --reranker               -r      [rrf|voyage|jina_image|jina  Reranker to use              │
│                                  _text]                       [default: rrf]               │
│ --help                   -h                                   Show this message and exit.  │
╰────────────────────────────────────────────────────────────────────────────────────────────╯

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

This is a benchmark for PDF RAG embedding systems.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published