BioASQ 13b: A Multi-Stage Pipeline for Biomedical Question Answering

This repository contains the code for our submission to the BioASQ Challenge 13, Task B. Our system is a multi-stage pipeline that first retrieves relevant documents and then uses them to generate precise answers.

Our Core Approach:

Phase A (Retrieval): We use a hybrid retrieval approach. An initial set of candidate documents is fetched using a traditional sparse retriever (BM25). These candidates are then re-ranked using a fine-tuned BERT-based cross-encoder to improve relevance.
Phase B (Generation): The top-ranked documents from Phase A are fed as context to a generative model to produce the final factoid, list, or summary answers.

System Architecture

Our pipeline processes a question in sequential phases to arrive at the final answer.

1. BM25 Indexing & Search (phaseA-BM25):

A searchable index of the biomedical literature is created.
For an incoming question, this module performs a fast, keyword-based search to retrieve a large set of potentially relevant documents (e.g., top 100).

2. Neural Reranking (phaseA-reranker):

The documents from the BM25 search are passed to a fine-tuned cross-encoder model (e.g., BioBERT).
This model scores each (question, document) pair for relevance, producing a more accurate ranking.

3. Answer Generation (phaseB, phaseAp):

The top N most relevant documents (e.g., top 5) from the reranker are concatenated to form a context.
The question and the context are passed to a language model to generate the final answer in the required format.

Performance

Pending.

Setup and Installation

Follow these steps to set up the environment and prepare the necessary data and models.

1. Prerequisites

Python 3.9+
A system with sufficient RAM and a modern NVIDIA GPU (for the reranker and generation phases).

2. Clone the Repository

git clone https://github.com/bioinformatics-ua/BioASQ13B
cd BioASQ13B

3. Install Dependencies

Create and activate a virtual environment, then install the required packages.

python -m venv venv
source venv/bin/activate 
pip install -r requirements.txt

4. Download Data & Build Indexes

You must download the official BioASQ datasets and build the BM25 index.

# Download the baseline data (update script if necessary)
python data/baselines/download_baselines.py

# Create the BM25 search index
python phaseA-BM25/create_indexes.py --path [path/to/bioasq/corpus]

# Download our fine-tuned models (if you're hosting them)
# available on huggingface, support is still coming

Running the Pipeline

The easiest way to run the full pipeline is by using the provided shell scripts in the /scripts/Sample directory. Please inspect these scripts and update any hardcoded paths before running.

Phase A: Document Retrieval & Reranking

This phase trains the reranker and then uses it to process a set of questions.

cd scripts/Sample/phaseA/

# 1. Train the reranker model (if not using a pre-trained one)
bash 1_trainer.sh

# 2. Rerank the documents for a given test file
bash 2_reranker.sh

# 3-6. Convert outputs to the required formats for evaluation/next steps
bash 3_convert.sh
# ... and so on for the other scripts.

Phase B: Answer Generation

This phase takes the reranked documents and generates the final answers.

cd scripts/Sample/phaseB/  # or phaseAp

# 1. Look up abstracts for the top documents
bash 1_abstract_lookup.sh

# 2. Generate initial answers using an LLM or custom model
bash 2_initial_gen.sh

# 3. Post-process into final summaries/answers
bash 3_summaries.sh

# 4. Convert to the official BioASQ submission format
bash 4_convert.sh

Directory Structure

A brief overview of the key directories in this project.

├── data/                  # Scripts for downloading, processing, and managing data
├── phaseA-BM25/           # BM25 sparse retriever: indexing and searching
├── phaseA-reranker/       # BERT-based cross-encoder: training and inference
├── phaseB/                # Answer generation and summarization logic
├── phaseAp/               # Alternative/experimental generation logic
├── scripts/               # Wrapper scripts to execute the full pipeline
├── requirements.txt       # Project dependencies
└── README.md              # This file

License

Distributed under the [MIT License]. See LICENSE.txt for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BioASQ 13b: A Multi-Stage Pipeline for Biomedical Question Answering

System Architecture

Performance

Pending.

Setup and Installation

1. Prerequisites

2. Clone the Repository

3. Install Dependencies

4. Download Data & Build Indexes

Running the Pipeline

Phase A: Document Retrieval & Reranking

Phase B: Answer Generation

Directory Structure

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
phaseA-BM25		phaseA-BM25
phaseA-reranker		phaseA-reranker
phaseB		phaseB
scripts/Sample		scripts/Sample
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

bioinformatics-ua/BioASQ13B

Folders and files

Latest commit

History

Repository files navigation

BioASQ 13b: A Multi-Stage Pipeline for Biomedical Question Answering

System Architecture

Performance

Pending.

Setup and Installation

1. Prerequisites

2. Clone the Repository

3. Install Dependencies

4. Download Data & Build Indexes

Running the Pipeline

Phase A: Document Retrieval & Reranking

Phase B: Answer Generation

Directory Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages