Information Retrieval Assignment 2

End‑to‑end pipeline for building a query–document training set, fine‑tuning a SPECTER‑2 encoder with Adapter Fusion, and ranking documents for ad‑hoc retrieval tasks.
Everything lives in src/ – plug‑and‑play from raw data to evaluation reports.

1 | Set‑up (Conda)

# create & activate environment (Python 3.10 tested)
conda create -n doc-retrieval python=3.10 -y
conda activate doc-retrieval

# install all Python dependencies
pip install -r requirements.txt

# (optional) expose src/ to Python
export PYTHONPATH="${PWD}/src:${PYTHONPATH}"

GPU users: install the matching CUDA‑enabled PyTorch before running pip install.

2 | Get the data & model weights 📥

Large files are not tracked by git.
Open data/README.md for Google Drive links and SHA‑256 checksums, then:

cd data
# download the three ZIP archives listed in the README
unzip new-data.zip
unzip precomputed_embeddings.zip
unzip new-weights.zip

Your data/ tree should now match:

data/
 ├─ data/1_raw_data/…
 ├─ data/3_final_data/doc_embeddings_*.pkl
 └─ weights/best_finetuned_adhoc_query_adapter_*/

3 | Run something 🚀

Quick demo (≈30 s)

python src/demo_retrieval.py        --query "graph neural networks for molecule property prediction"

Ranks documents with pre‑computed embeddings & the best adapters.

Full pipeline (long)

python src/main_pipeline.py

Re‑creates everything from scratch: merges raw data, generates hard negatives, fine‑tunes adapters, computes embeddings, optimises the scoring function, evaluates.

Key flags:

flag	default	purpose
`--device`	`cpu`	set `cuda:0` for GPU
`--config`	`configs/default.yaml`	override stage parameters
`--skip-embedding`	`false`	reuse existing embeddings

4 | Outputs

Embeddings: data/data/3_final_data/doc_embeddings_*.pkl
Fine‑tuned adapters: data/weights/*
Evaluation reports: outputs/evaluation/*.json

5 | Troubleshooting

symptom	fix
`FileNotFoundError: .../1_raw_data/...`	ensure all ZIPs are unzipped into `data/`
`OSError: "cuda" not available`	install CUDA‑compatible PyTorch or use `--device cpu`
RAM spikes	run `main_pipeline.py` with partial flags to process chunks

6 | Licence & citation

Released under Apache 2.0.
If you use this code, please cite the accompanying paper (BibTeX in CITATION.cff).

Happy retrieving! 🎉

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Information Retrieval Assignment 2

1 | Set‑up (Conda)

2 | Get the data & model weights 📥

3 | Run something 🚀

Quick demo (≈30 s)

Full pipeline (long)

4 | Outputs

5 | Troubleshooting

6 | Licence & citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

KarlLearnsAI/Dataset-Retrieval-System

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval Assignment 2

1 | Set‑up (Conda)

2 | Get the data & model weights 📥

3 | Run something 🚀

Quick demo (≈30 s)

Full pipeline (long)

4 | Outputs

5 | Troubleshooting

6 | Licence & citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Quick demo (≈30 s)

Packages