This repository contains the code for our submission to the BioASQ Challenge 13, Task B. Our system is a multi-stage pipeline that first retrieves relevant documents and then uses them to generate precise answers.
Our Core Approach:
- Phase A (Retrieval): We use a hybrid retrieval approach. An initial set of candidate documents is fetched using a traditional sparse retriever (BM25). These candidates are then re-ranked using a fine-tuned BERT-based cross-encoder to improve relevance.
- Phase B (Generation): The top-ranked documents from Phase A are fed as context to a generative model to produce the final factoid, list, or summary answers.
Our pipeline processes a question in sequential phases to arrive at the final answer.
1. BM25 Indexing & Search (phaseA-BM25):
- A searchable index of the biomedical literature is created.
- For an incoming question, this module performs a fast, keyword-based search to retrieve a large set of potentially relevant documents (e.g., top 100).
2. Neural Reranking (phaseA-reranker):
- The documents from the BM25 search are passed to a fine-tuned cross-encoder model (e.g., BioBERT).
- This model scores each
(question, document)pair for relevance, producing a more accurate ranking.
3. Answer Generation (phaseB, phaseAp):
- The top N most relevant documents (e.g., top 5) from the reranker are concatenated to form a context.
- The question and the context are passed to a language model to generate the final answer in the required format.
Follow these steps to set up the environment and prepare the necessary data and models.
- Python 3.9+
- A system with sufficient RAM and a modern NVIDIA GPU (for the reranker and generation phases).
git clone https://github.com/bioinformatics-ua/BioASQ13B
cd BioASQ13BCreate and activate a virtual environment, then install the required packages.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtYou must download the official BioASQ datasets and build the BM25 index.
# Download the baseline data (update script if necessary)
python data/baselines/download_baselines.py
# Create the BM25 search index
python phaseA-BM25/create_indexes.py --path [path/to/bioasq/corpus]
# Download our fine-tuned models (if you're hosting them)
# available on huggingface, support is still comingThe easiest way to run the full pipeline is by using the provided shell scripts in the /scripts/Sample directory. Please inspect these scripts and update any hardcoded paths before running.
This phase trains the reranker and then uses it to process a set of questions.
cd scripts/Sample/phaseA/
# 1. Train the reranker model (if not using a pre-trained one)
bash 1_trainer.sh
# 2. Rerank the documents for a given test file
bash 2_reranker.sh
# 3-6. Convert outputs to the required formats for evaluation/next steps
bash 3_convert.sh
# ... and so on for the other scripts.This phase takes the reranked documents and generates the final answers.
cd scripts/Sample/phaseB/ # or phaseAp
# 1. Look up abstracts for the top documents
bash 1_abstract_lookup.sh
# 2. Generate initial answers using an LLM or custom model
bash 2_initial_gen.sh
# 3. Post-process into final summaries/answers
bash 3_summaries.sh
# 4. Convert to the official BioASQ submission format
bash 4_convert.shA brief overview of the key directories in this project.
├── data/ # Scripts for downloading, processing, and managing data
├── phaseA-BM25/ # BM25 sparse retriever: indexing and searching
├── phaseA-reranker/ # BERT-based cross-encoder: training and inference
├── phaseB/ # Answer generation and summarization logic
├── phaseAp/ # Alternative/experimental generation logic
├── scripts/ # Wrapper scripts to execute the full pipeline
├── requirements.txt # Project dependencies
└── README.md # This file
Distributed under the [MIT License]. See LICENSE.txt for more information.