Skip to content

bioinformatics-ua/BioASQ13B

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioASQ 13b: A Multi-Stage Pipeline for Biomedical Question Answering

Python Version Hugging Face Transformers PyTorch Pyserini (BM25)

This repository contains the code for our submission to the BioASQ Challenge 13, Task B. Our system is a multi-stage pipeline that first retrieves relevant documents and then uses them to generate precise answers.

Our Core Approach:

  1. Phase A (Retrieval): We use a hybrid retrieval approach. An initial set of candidate documents is fetched using a traditional sparse retriever (BM25). These candidates are then re-ranked using a fine-tuned BERT-based cross-encoder to improve relevance.
  2. Phase B (Generation): The top-ranked documents from Phase A are fed as context to a generative model to produce the final factoid, list, or summary answers.

System Architecture

Our pipeline processes a question in sequential phases to arrive at the final answer.

1. BM25 Indexing & Search (phaseA-BM25):

  • A searchable index of the biomedical literature is created.
  • For an incoming question, this module performs a fast, keyword-based search to retrieve a large set of potentially relevant documents (e.g., top 100).

2. Neural Reranking (phaseA-reranker):

  • The documents from the BM25 search are passed to a fine-tuned cross-encoder model (e.g., BioBERT).
  • This model scores each (question, document) pair for relevance, producing a more accurate ranking.

3. Answer Generation (phaseB, phaseAp):

  • The top N most relevant documents (e.g., top 5) from the reranker are concatenated to form a context.
  • The question and the context are passed to a language model to generate the final answer in the required format.

Performance

Pending.

Setup and Installation

Follow these steps to set up the environment and prepare the necessary data and models.

1. Prerequisites

  • Python 3.9+
  • A system with sufficient RAM and a modern NVIDIA GPU (for the reranker and generation phases).

2. Clone the Repository

git clone https://github.com/bioinformatics-ua/BioASQ13B
cd BioASQ13B

3. Install Dependencies

Create and activate a virtual environment, then install the required packages.

python -m venv venv
source venv/bin/activate 
pip install -r requirements.txt

4. Download Data & Build Indexes

You must download the official BioASQ datasets and build the BM25 index.

# Download the baseline data (update script if necessary)
python data/baselines/download_baselines.py

# Create the BM25 search index
python phaseA-BM25/create_indexes.py --path [path/to/bioasq/corpus]

# Download our fine-tuned models (if you're hosting them)
# available on huggingface, support is still coming

Running the Pipeline

The easiest way to run the full pipeline is by using the provided shell scripts in the /scripts/Sample directory. Please inspect these scripts and update any hardcoded paths before running.

Phase A: Document Retrieval & Reranking

This phase trains the reranker and then uses it to process a set of questions.

cd scripts/Sample/phaseA/

# 1. Train the reranker model (if not using a pre-trained one)
bash 1_trainer.sh

# 2. Rerank the documents for a given test file
bash 2_reranker.sh

# 3-6. Convert outputs to the required formats for evaluation/next steps
bash 3_convert.sh
# ... and so on for the other scripts.

Phase B: Answer Generation

This phase takes the reranked documents and generates the final answers.

cd scripts/Sample/phaseB/  # or phaseAp

# 1. Look up abstracts for the top documents
bash 1_abstract_lookup.sh

# 2. Generate initial answers using an LLM or custom model
bash 2_initial_gen.sh

# 3. Post-process into final summaries/answers
bash 3_summaries.sh

# 4. Convert to the official BioASQ submission format
bash 4_convert.sh

Directory Structure

A brief overview of the key directories in this project.

├── data/                  # Scripts for downloading, processing, and managing data
├── phaseA-BM25/           # BM25 sparse retriever: indexing and searching
├── phaseA-reranker/       # BERT-based cross-encoder: training and inference
├── phaseB/                # Answer generation and summarization logic
├── phaseAp/               # Alternative/experimental generation logic
├── scripts/               # Wrapper scripts to execute the full pipeline
├── requirements.txt       # Project dependencies
└── README.md              # This file

License

Distributed under the [MIT License]. See LICENSE.txt for more information.

About

Code for participation in BioASQ Task 13b

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •