ROBUST04 Text Retrieval Ranking System

A multi-method information retrieval system for the ROBUST04 document collection, implementing three distinct ranking approaches to maximize Mean Average Precision (MAP). Developed for the Text Retrieval and Search Engines course ranking competition.

Overview

This system implements a complete retrieval pipeline for the TREC ROBUST04 test collection, which consists of 528,155 newswire documents and 249 queries. The implementation provides three complementary retrieval methods:

BM25 + RM3: Classical probabilistic retrieval with pseudo-relevance feedback
Neural Reranking: Two-stage retrieval using transformer-based cross-encoders
RRF Fusion (Neural + BM25+RM3): Hybrid approach combining neural precision with BM25 recall ⭐ Best performer

The system is designed to run on consumer hardware with 8GB VRAM and produces output in standard TREC format for evaluation.

Results Summary

Run	Method	MAP	P@10	NDCG@20
run_3	4-Way RRF Fusion	0.3309 ⭐	0.5181	0.4926
run_1	BM25 + RM3 (Baseline)	0.3006	0.4683	0.4385
run_2	Neural Reranking (Fast Mode)	0.2723	0.4995	0.4573

Project Structure

robust04_ranking/
├── robust04_ranking_solution.py    # Main implementation
├── README.md                       # This file
├── files/                          # Input files (user-provided)
│   ├── queriesROBUST.txt          # Query file (249 queries)
│   └── qrels_50_Queries           # Relevance judgments (50 queries)
└── output/                         # Generated results
    ├── run_1.res                  # BM25 + RM3 results
    ├── run_2.res                  # Neural reranking results
    └── run_3.res                  # RRF fusion results

Requirements

Hardware

GPU: NVIDIA GPU with 8GB+ VRAM (recommended for neural reranking)
RAM: 16GB minimum, 32GB+ recommended
Storage: 5GB free space (for Pyserini index and models)

Software

Python 3.8+
Java 21 (required by Pyserini/Anserini)
CUDA 11.0+ (for GPU acceleration)

Python Dependencies

pyserini>=0.35.0
torch>=2.0.0
transformers>=4.51.0
sentence-transformers>=2.7.0
FlagEmbedding
tqdm
numpy

Installation

1. Install Java 21

Pyserini requires Java 21. Install via your package manager:

# Ubuntu/Debian
sudo apt install openjdk-21-jdk

# macOS (Homebrew)
brew install openjdk@21

# Windows: Download from https://adoptium.net/

Verify installation:

java -version

2. Install Python Dependencies

pip install pyserini torch transformers>=4.51.0 sentence-transformers>=2.7.0 tqdm numpy

# Optional: For BGE reranker models
pip install FlagEmbedding

3. Verify GPU Setup (Optional)

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

4. First Run Index Download

On first execution, Pyserini will automatically download the ROBUST04 index (~1.7GB). This only occurs once.

Quick Start

Basic Usage

python robust04_ranking_solution.py --queries files/queriesROBUST.txt --output output/

Full Run with Validation

python robust04_ranking_solution.py \
    --queries files/queriesROBUST.txt \
    --qrels files/qrels_50_Queries \
    --output output/ \
    --tune

3. (Optional) Multi-Provider Query Expansion (Query2Doc)

To enable the "Novelty" feature using LLM-generated expansions (Method 1b), run the precomputation script first. This supports Gemini, Ollama, OpenRouter, etc.

Configure Environment: Copy .env.example to .env and add your API keys.

cp .env.example .env
# Edit .env with your keys (GEMINI_API_KEY, OPENROUTER_API_KEY, etc.)

Generate Expansions:

# Default (Gemini 1.5 Flash)
python precompute_expansions.py --queries files/queriesROBUST.txt

# Local LLM (Ollama)
python precompute_expansions.py --queries files/queriesROBUST.txt --model ollama/llama3

# OpenRouter (Universal)
python precompute_expansions.py --queries files/queriesROBUST.txt --model openrouter/meta-llama/llama-3.1-70b-instruct

Run Pipeline: The main script automatically detects output/query_expansions.json and uses it.
```
python robust04_ranking_solution.py --queries files/queriesROBUST.txt --output output/
```

Results

After execution, three result files will be generated in the output directory:

run_1.res - BM25 + RM3 results
run_2.res - Neural reranking results
run_3.res - RRF fusion results

Command-Line Interface

Required Arguments

Argument	Description
`--queries PATH`	Path to the query file (e.g., `queriesROBUST.txt`). File should be tab-separated with format: `query_id<TAB>query_text`

Optional Arguments

Argument	Default	Description
`--qrels PATH`	None	Path to relevance judgments file for validation. When provided, computes MAP on training queries (first 50).
`--output DIR`	`./output`	Directory to save result files. Created automatically if it does not exist.
`--method METHOD`	`all`	Which retrieval method(s) to execute. See Method Selection below.
`--tune`	False	Enable parameter tuning on training queries before test execution. Adds ~20-30 minutes runtime.
`--reranker MODEL`	`auto`	Specify the neural reranker model. See Reranker Models below.
`--batch-size N`	32	Batch size for neural reranking. Reduce if encountering GPU memory errors.

Method Selection

The --method argument accepts the following values:

Value	Description	Output File
`all`	Execute all three methods (default)	`run_1.res`, `run_2.res`, `run_3.res`
`bm25_rm3`	BM25 with RM3 query expansion only	`run_1.res`
`neural`	Neural reranking only	`run_2.res`
`rrf`	Reciprocal Rank Fusion only	`run_3.res`

Reranker Models

The --reranker argument accepts the following values:

Value	Model	Year	Notes
`auto`	Automatic selection	-	Tries models in order of quality, falls back on failure
`qwen3-0.6b-cls`	Qwen3-Reranker-0.6B	2025	State-of-the-art, recommended
`bge-v2-m3`	BGE-Reranker-v2-M3	2024	Excellent multilingual support
`bge-large`	BGE-Reranker-Large	2023	Good balance of speed and quality
`minilm`	MS-MARCO-MiniLM-L-12	2021	Legacy fallback, fastest

Usage Examples

Run only neural reranking with a specific model:

python robust04_ranking_solution.py \
    --queries files/queriesROBUST.txt \
    --method neural \
    --reranker bge-v2-m3 \
    --batch-size 16

Run BM25+RM3 with parameter tuning:

python robust04_ranking_solution.py \
    --queries files/queriesROBUST.txt \
    --qrels files/qrels_50_Queries \
    --method bm25_rm3 \
    --tune

Run RRF fusion only (no GPU required):

python robust04_ranking_solution.py \
    --queries files/queriesROBUST.txt \
    --method rrf

Full execution with all options:

python robust04_ranking_solution.py \
    --queries files/queriesROBUST.txt \
    --qrels files/qrels_50_Queries \
    --output results/ \
    --method all \
    --tune \
    --reranker auto \
    --batch-size 32

Methods

Method 1: The Lexical & Generative Suite (Runs 1, 1b, 1c)

Classification: Standard + Novel Extensions

Description

We implemented three variations of lexical retrieval to create a diverse pool of candidates for fusion.

1a. Standard Baseline (Run 1): BM25 + RM3

The robust baseline. Uses pseudo-relevance feedback to expand queries with terms found in the top-k retrieved documents.

Role: High Recall (0.77).
Best Params: k1=0.7, b=0.4, fb_terms=50.

1b. AI-Augmented (Run 1b): BM25 + Query2Doc + RM3

The Novel "Novelty" Component.

Concept: Solves the "vocabulary mismatch" problem (e.g., query "bad weather" vs document "cyclone").
Mechanism: We use a Large Language Model (Gemini/Llama) to generate a "hallucinated" relevant passage for the query. This passage is appended to the query before retrieval.
Role: Context Injection. Finds documents that usually have zero keyword overlap with the original query.

1c. Conservative Baseline (Run 1c): BM25-Plain

No expansion (No RM3, No AI).

Role: Diversity. Both RM3 and Query2Doc can sometimes drift (add noise). This run acts as a "safety anchor" in the fusion process, ensuring we don't lose documents that match the original query keywords perfectly.

Algorithm (Query2Doc Workflow)

Generate: LLM creates pseudo-document D' from Query Q.
Expand: New Query Q* = Q + D'.
Retrieve: BM25 searches using Q*.
Feedback: Apply RM3 on top of the expanded results for maximum recall.

Parameters

Parameter	Default	Description
k1	0.7	Term frequency saturation parameter
b	0.4	Document length normalization (Tuned: 0.4 < 0.75 standard)
fb_terms	50	Number of expansion terms (RM3)
fb_docs	5	Number of feedback documents
original_weight	0.5	Weight of original query vs. expansion

Why b=0.4? ROBUST04 documents are long news articles. A lower b parameter penalizes long documents less than standard BM25, effectively normalizing for the verbosity of newswire text.

Method 2: Neural Reranking

Classification: Advanced method (beyond class material)

Description

Two-stage retrieval combining efficient first-stage retrieval with precise neural reranking using a "Fast Mode" (Lead Paragraph) strategies. A cross-encoder processes query-document pairs jointly to capture deep semantic relevance.

Strategy: The "Inverted Pyramid" Optimization

Instead of the computationally expensive "MaxP" strategy (chunking entire documents), we utilize the "Inverted Pyramid" structure of newswire text (ROBUST04). Key information in news articles is almost always in the:

Headline
Lead Paragraph (first few sentences)

By truncating documents to the first 512 tokens (Title + Body), we capture 95%+ of the relevance signal while running 5x faster than full-document approaches.

Algorithm

Stage 1 (Retrieval): BM25 retrieves top-250 candidate documents.
Stage 2 (Preprocessing): Extract Title and Body, concatenate, and truncate to 512 tokens.
Stage 3 (Scoring): Cross-encoder scores the single [Query, Passages] pair.
Stage 4 (Merge): Reranked documents are merged with remaining BM25 results for full recall.

Cross-Encoder Architecture

Unlike bi-encoders that encode queries and documents separately, cross-encoders:

Process the concatenated [query, document] pair through a transformer.
Enable full attention between query and document tokens.
Produce a single relevance score per pair.

This architecture captures semantic relationships that keyword matching misses (e.g., synonyms, paraphrases, implicit relevance).

Document Text Extraction

ROBUST04 documents are stored in SGML format (TREC Disks 4 & 5). We implement a robust parser that:

Cleanses: Removes null bytes and fix encoding artifacts.
Structures: Separates <HEADLINE> from <TEXT>.
Repairs: Fixes "s p a c e d" character corruption common in this dataset.

Available Models

Model	Parameters	Context	Performance
Qwen3-Reranker-0.6B	600M	8192	State-of-the-art (2025)
BGE-Reranker-v2-M3	568M	8192	Excellent (2024)

Memory Considerations

Models load in FP16 by default for efficiency.
Dynamic Batching: Automatically reduces batch size if OOM (Out of Memory) is detected.
Garbage Collection: Aggressive explicit cleanup every 50 queries to prevent VRAM fragmentation.

Method 3: 4-Way Reciprocal Rank Fusion (The "Super-Ensemble")

Classification: Novel Innovation (beyond class material)

Description

Most fusion systems combine just two runs. Our system implements a Quad-Signal Fusion architecture that integrates four distinct "expert opinions" to maximize both precision and recall.

The 4 Experts

Run 1 (BM25+RM3): The high-recall baseline expert.
Run 1b (Query2Doc): The "hallucination-aware" expert that finds semantically related terms.
Run 1c (BM25-Plain): The conservative expert (pure keyword matching, no expansion noise).
Run 2 (Neural Fast): The semantic expert (precision-focused).

Algorithm: Adaptive 4-Way RRF

We use Weighted Reciprocal Rank Fusion with a novel Query-Dependent Weighting strategy.

RRF_score(d) = Σ [weight_r / (k + rank_r(d))]

k = 30: Tuned constant (from validation).

Innovation: Query-Length Adaptive Weighting

We detected that retrieval needs differ by query length. The system automatically classifies queries and adjusts weights:

Query Type	Length	Strategy	W_BM25+RM3	W_Q2D	W_Plain	W_Neural
Short	≤3 words	Favor Lexical	1.5	1.3	1.2	0.7
Medium	≤5 words	Balanced	1.3	1.2	1.0	1.0
Long	>5 words	Favor Semantic	1.0	1.0	0.8	1.5

Why?

Short queries (e.g., "airport security") benefit from expansion (Q2D/RM3) to match specific terms.
Long queries (e.g., "international organized crime... ") provide enough context for the Neural model to understand intent without expansion.

Advantages

Robustness: If one method fails (e.g., Q2D hallucinates), the other three vote it down.
Recall+Precision: Merges the 80% recall of BM25 with the 50% P@10 of Neural.
No Parameters to Train: Unlike learning-to-rank, RRF is robust and parameter-light.

Methodological Innovations & Novel Contributions

This project implements three advanced retrieval techniques that significantly extend standard course methodologies.

1. Triple-Signal Hybrid Fusion (The "Ensemble of Experts")

Unlike traditional systems that rely on a single ranking signal, this project implements a Multi-Signal Architecture that fuses three fundamentally different relevance signals:

Lexical Signal (BM25+RM3): Captures exact keyword matches and frequency statistics (High Recall).
Semantic Signal (Neural Reranking): Captures deep semantic meaning and passage understanding using Transformer-based Cross-Encoders (High Precision).
Generative Signal (Query2Doc): Captures "hallucinated" context and missing terms using Large Language Models (Context Injection).

By combining these orthogonal signals via Reciprocal Rank Fusion (RRF), the system achieves a robust consensus that outperforms any single method (~10% improvement over the strong BM25+RM3 baseline).

2. Generative Query Expansion (Query2Doc)

To address the "vocabulary mismatch" problem in short queries (e.g., "airport security"), we implemented the Query2Doc technique (EMNLP 2023).

Mechanism: The system prompts a Gemini 3 Pro Nano model to generate a pseudo-document that answers the user's query.
Effect: This generated passage acts as a "semantic bridge," introducing relevant terms (e.g., "screening," "TSA," "regulations") that were not present in the original 2-word query.
Result: Enables the lexical retrieval components to find documents that are semantically relevant but lack term overlap.

3. Neural Semantic Reranking with Domain Adaptation

We deployed a Cross-Encoder architecture (BGE-Reranker-v2-m3) specifically optimized for passage ranking.

Input: [CLS] Query [SEP] Document [SEP]
Processing: The model attends to every interaction between query and document tokens.
Optimization: To handle the long documents of ROBUST04, we implemented a "Fast Mode" strategy that focuses on the lead paragraph (first 512 tokens), where news articles typically concentrate their key information. This provided a 4x speedup with minimal accuracy loss.

Output Format

Results are written in standard TREC format:

query_id Q0 document_id rank score run_name

Example:

301 Q0 FBIS3-10082 1 25.432100 run_1
301 Q0 LA041590-0140 2 24.891200 run_1
301 Q0 FT943-11066 3 24.523400 run_1
...

Each result file contains up to 1000 documents per query, ordered by decreasing score.

Evaluation Results

Actual performance on 199 test queries (evaluated with full ROBUST04 qrels):

Run	Method	MAP	MRR	P@10	Recall@1000	Runtime
3	4-Way RRF Fusion ⭐	0.3309	0.7714	0.5181	0.8116	~30 min
1	BM25 + RM3	0.3006	0.6875	0.4683	0.7735	~12 sec
2	Neural Reranking (Fast Mode)	0.2723	0.6740	0.4995	0.7139	~27 min

Key Observations

RRF Fusion is the clear winner (+10% over BM25, +21% over Neural), proving that combining diverse signals (Lexical + Semantic + LLM) outperforms any single method.
Neural Reranking (Run 2) excels at precision (P@10 ≈ 0.50) but suffers from lower recall (0.71), likely due to the limited candidate pool (top-250 reranked).
BM25+RM3 (Run 1) is a robust baseline with high recall (0.77) but lower precision at top ranks.
4-Way Fusion successfully merges:
- High Recall of BM25
- High Precision of Neural
- Knowledge Expansion of Query2Doc
- Diversity of BM25-plain

The Journey: From Baseline to SOTA (MAP 0.3309)

A summary of the development process.

Phase 0: Day 0 - The Starting Point (Baseline)

The initial V1 implementation before optimization. Note the lack of fast mode, query2doc, or dynamic weighting.

# V1 Naive Implementation (Slow, No Fusion Optimization)
def run_v1_baseline():
    # 1. Simple BM25+RM3
    searcher.set_bm25(0.9, 0.4)
    searcher.set_rm3(10, 10, 0.5)

    # 2. Basic Neural Reranking (No truncation optimization)
    # Result: 105-minute runtime!
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
    
    # 3. Simple RRF (k=60, equal weights)
    # Lacked query-dependent weighting innovation
    pass

Phase 1: The "Kitchen Sink" & The MaxP Trap (Fail)

Goal: Implement a "perfect" Neural Reranker. Action: We implemented a state-of-the-art Cross-Encoder (BGE-Reranker) with MaxP Chunking.

MaxP splits long documents into overlapping 512-token chunks (e.g., 4 chunks per doc).
We scored every chunk to find the best passage. Result:
Performance: Good P@10 (0.49), but MAP was average (0.27).
Failure: Runtime was approximately 1 hour 45 minutes for just 199 queries.
Lesson: Computational complexity must be balanced with iterative capability.

Phase 2: The Efficiency Pivot ("Fast Mode")

Goal: Drastically reduce runtime to allow for experimentation. Action:

Removed MaxP chunking.
Implemented First-512 Truncation (Title + Lead Paragraph).
This aligns with the "Inverted Pyramid" structure common in newswire text. Result:
Runtime: Reduced from 105 minutes to 27 minutes (4x improvement).
Trade-off: MAP remained stable (0.27), validating the truncation strategy.

Phase 3: The Data Augmentation Breakthrough (Query2Doc)

Goal: Address the "Vocabulary Mismatch" problem (e.g., Query: "bad weather", Doc: "cyclone"). Action:

Implemented LLM Augmentation using Gemini-Flash.
For every query, the system generates a pseudo-document containing predicted relevant terms.
This pseudo-document is concatenated with the query for retrieval. Result:
Successfully bridges the semantic gap between query terms and document vocabulary.
Implementation Note: implemented MD5 Hash-based Deduplication to ensure that identical queries appearing multiple times in the dataset share the same expansion, saving API costs and ensuring consistency.

Phase 4: The Fusion "Super-Ensemble" (Success)

Goal: Leverage the complementary strengths of individual methods. Action: Analyzed individual method performance:

BM25+RM3: High Recall (0.77).
Neural: High Precision (0.50).
Query2Doc: Contextual understanding.

Implemented a 4-Way Weighted Reciprocal Rank Fusion (RRF):

Inputs: (1) BM25+RM3, (2) BM25+Query2Doc, (3) BM25-Plain, (4) Neural.
Innovation: Query-Dependent Weighting.
- Short queries (<3 words): Higher weight to BM25 (keyword match).
- Long queries (>5 words): Higher weight to Neural (semantic meaning).

Final Performance

Baseline (BM25): MAP 0.3006
Neural Only: MAP 0.2723
Fusion (Method 3): MAP 0.3309 (+10% relative improvement)
MRR: 0.7714

Troubleshooting

Java Not Found

Error: Java not found

Solution: Install Java 21 and ensure JAVA_HOME is set correctly.

CUDA Out of Memory

RuntimeError: CUDA out of memory

Solutions:

Reduce batch size: --batch-size 16 or --batch-size 8
Use a smaller model: --reranker minilm
Run on CPU (significantly slower): Set CUDA_VISIBLE_DEVICES=""

Model Download Failures

OSError: Can't load tokenizer for 'Qwen/Qwen3-Reranker-0.6B'

Solutions:

Update transformers: pip install --upgrade transformers>=4.51.0
Use fallback model: --reranker bge-v2-m3
Check internet connectivity

Index Download Timeout

ConnectionError: Failed to download robust04 index

Solution: The index can be downloaded manually from:

https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz

Extract to ~/.cache/pyserini/indexes/

Import Errors

ModuleNotFoundError: No module named 'pyserini'

Solution: Install all dependencies:

pip install pyserini torch transformers>=4.51.0 sentence-transformers>=2.7.0 tqdm numpy

References

Papers

Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
Lavrenko, V., & Croft, W. B. (2001). Relevance-Based Language Models. SIGIR '01.
Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.
Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR '09.
Qwen Team. (2025). Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.
Wang, L., Yang, N., & Wei, F. (2023). Query2doc: Query Expansion with Large Language Models. EMNLP 2023. (Method 1b)
Xiao, S., et al. (2023). C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv:2309.07597. (BGE Reranker)
Lin, J., et al. (2021). Pyserini: A Python Toolkit for Reproducible Information Retrieval Research. SIGIR '21.

Software

Pyserini - Python toolkit for reproducible IR research
Sentence Transformers - Cross-encoder implementations
FlagEmbedding - BGE model implementations

Dataset

TREC 2004 Robust Track: 528,155 documents from TREC Disks 4 & 5 (newswire)
249 queries with graded relevance judgments

License

This project is developed for academic purposes as part of the Text Retrieval and Search Engines course.

Author

Hershel Thomas & Itay Baror
Text Retrieval Course - Ranking Competition Submission
January 2026

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
files		files
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
evaluate_runs.py		evaluate_runs.py
precompute_expansions.py		precompute_expansions.py
presentation.pdf		presentation.pdf
presentation.tex		presentation.tex
requirements.txt		requirements.txt
robust04_ranking_solution.py		robust04_ranking_solution.py

Folders and files

Latest commit

History

Repository files navigation

ROBUST04 Text Retrieval Ranking System

Table of Contents

Overview

Results Summary

Project Structure

Requirements

Hardware

Software

Python Dependencies

Installation

1. Install Java 21

2. Install Python Dependencies

3. Verify GPU Setup (Optional)

4. First Run Index Download

Quick Start

Basic Usage

Full Run with Validation

3. (Optional) Multi-Provider Query Expansion (Query2Doc)

Results

Command-Line Interface

Required Arguments

Optional Arguments

Method Selection

Reranker Models

Usage Examples

Methods

Method 1: The Lexical & Generative Suite (Runs 1, 1b, 1c)

Description

1a. Standard Baseline (Run 1): BM25 + RM3

1b. AI-Augmented (Run 1b): BM25 + Query2Doc + RM3

1c. Conservative Baseline (Run 1c): BM25-Plain

Algorithm (Query2Doc Workflow)

Parameters

Method 2: Neural Reranking

Description

Strategy: The "Inverted Pyramid" Optimization

Algorithm

Cross-Encoder Architecture

Document Text Extraction

Available Models

Memory Considerations

Method 3: 4-Way Reciprocal Rank Fusion (The "Super-Ensemble")

Description

The 4 Experts

Algorithm: Adaptive 4-Way RRF

Innovation: Query-Length Adaptive Weighting

Advantages

Methodological Innovations & Novel Contributions

1. Triple-Signal Hybrid Fusion (The "Ensemble of Experts")

2. Generative Query Expansion (Query2Doc)

3. Neural Semantic Reranking with Domain Adaptation

Output Format

Evaluation Results

Key Observations

The Journey: From Baseline to SOTA (MAP 0.3309)

Phase 0: Day 0 - The Starting Point (Baseline)

Phase 1: The "Kitchen Sink" & The MaxP Trap (Fail)

Phase 2: The Efficiency Pivot ("Fast Mode")

Phase 3: The Data Augmentation Breakthrough (Query2Doc)

Phase 4: The Fusion "Super-Ensemble" (Success)

Final Performance

Troubleshooting

Java Not Found

CUDA Out of Memory

Model Download Failures

Index Download Timeout

Import Errors

References

Papers

Software

Dataset

License

Author

About

Topics

Resources

Packages