GitHub - bioinformatics-ua/BioASQ: Deep architectures for biomedical question answering

Repository with code for the article: ...

Both BioDeepRank and Attn-BioDeepRank are implemented in tensorflow with keras and can be easily configured by a yaml FILE.

To see how to use both models individually check the Interaction Models notebook.

THE SOURCE CODE FOR BOTH MODELS ARE IN THE FOLDER: models/(DeepRank) and models/subnetworks

The complete system's training and inference are based on a pipeline and the entry point is the main.py file.

Disclaimer: All code on this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY

Generic pipeline for Information Retrieval and BioDeepRank

Description

This repository implements a generic pipeline that can be used to address the majoraty of the IR tasks.

The pipeline is described by a configuration file in yaml, which facilitate the prototyping and testing of complex IR models.

The motivation is to offer an infrastructure that allows multiple tests (fine-tunning) of complex models in multiples datasets.

Pipeline building blocks

The first input to the pipeline is allways the dataset corpora and the queries (if appliable). This input is fed to a chain of modules, where the output of the previous one is fed to as input of the next one.

Each module is dynamicly loaded in runtime, which gives a high degree of freedom, since each module can be fully customized and added to the pipeline.

Pipeline configuration file

folder config have multiple examples

(TODO)

cache_folder: "/path/to/folder"
corpora:
    name: "bioasq"
    folder: "path/to/folder.tar.gz" #corresponds to the tar.gz file
    files_are_compressed: true #(optinal) default is false
queries:
    train_file: "path/to/train_set.json"
    validation_file: "path/to/validation_set.json"
pipeline:
    - BM25:
        top_k: 2500
        tokenizer:
            Regex:
                stem: true
    - DeepRank:
        top_k: 10
        input_network:
            Q: 13 #number max of query tokens
            P: 5 #number max of snippets per query token
            S: 15 #number max of snippet tokens
            embedding_matrix: "auto" #creates a embedding matrix using fasttext library
        measure_network: "MeasureNetwork" #class name of the measure network
        aggregation_network: "AggregationNetwork" #class name of the aggregation network
        hyperparameters:
            optimizer: "adadelta" #(optinal) default is AdaDelta
            l2_regularization: 0.0001 #(optinal) default is 0.0001
            num_partially_positive_samples: 3
            num_negative_samples: 4
        tokenizer:
            Regex:
                stem: false

Quick explanation, the cache_folder is a backup folder used to save every repetitive process during the pipeline, for example, tokenization, BM25 index, model weights etc... Every model generates a unique name based on their properties, so if a file with the same name is found this will be loaded instead of been created.

Run

Train

python3 main.py config_example.yaml

if config_example has a property "k_fold:N" it will do N cross validation using the queries from the training file. During this process N trained models are saved.

Inference over a file

python3 main.py config_example.yaml --queries path/to/file

Inference over a single query

python3 main.py config_example.yaml --query "test query?"

Requirements

ELASTIC SEARCH SHOULD BE ALSO CONFIGURED, outherwise it can be skipped by removing it from the pipeline

For using the current pipeline is espected that the corpora is compressed and each file shuld be a .json with following format:

[
  {
    "id": "<str>",
    "title": "<str>",
    "abstract": "<str>"
  },
  ...
]

The compressed file can have multiple json files.

For the training and validation the data is expected to be in the follwing format:

[
  {
    "query_id": "<str>",
    "query": "<str>",
    "documents": ["<str>","<str>","<str>",...]
  },
  ...
]

The embeddings can be download from here https://github.com/ncbi-nlp/BioSentVec

Models weights

These weight files should be copy to the cache_folder directory, to be automatically loaded by the model when instatiated.

config/BioDeepRank_6b.yaml -> link

config/BioDeepRank_7b.yaml -> link

config/Attn-BioDeepRank_6b.yaml -> link

config/Attn-BioDeepRank_7b.yaml -> https://tinyurl.com/rc8ejoe

The embedding matrix was included as part of the models weights, so the resulting file has +/- 6gb.

Advanced details to individually use BioDeepRank and Attn-DeepRank

check the "Intecation Models" notebook, it has some various use cases.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
config		config
dataset		dataset
embeddings		embeddings
images		images
metrics		metrics
models		models
tokenizers		tokenizers
.gitignore		.gitignore
Interactive models.ipynb		Interactive models.ipynb
Pubmed data XML READER.ipynb		Pubmed data XML READER.ipynb
__init__.py		__init__.py
config_attention_on_server.yaml		config_attention_on_server.yaml
config_example.yaml		config_example.yaml
config_on_server.yaml		config_on_server.yaml
delete.py		delete.py
evaluate.py		evaluate.py
logger.py		logger.py
main.py		main.py
output.PNG		output.PNG
pipeline.py		pipeline.py
readme.md		readme.md
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

To see how to use both models individually check the Interaction Models notebook.

THE SOURCE CODE FOR BOTH MODELS ARE IN THE FOLDER: models/(DeepRank) and models/subnetworks

Generic pipeline for Information Retrieval and BioDeepRank

Description

Pipeline building blocks

Pipeline configuration file

Run

Requirements

Models weights

Advanced details to individually use BioDeepRank and Attn-DeepRank

About

Releases

Packages

Languages

bioinformatics-ua/BioASQ

Folders and files

Latest commit

History

Repository files navigation

To see how to use both models individually check the Interaction Models notebook.

THE SOURCE CODE FOR BOTH MODELS ARE IN THE FOLDER: models/(DeepRank) and models/subnetworks

Generic pipeline for Information Retrieval and BioDeepRank

Description

Pipeline building blocks

Pipeline configuration file

Run

Requirements

Models weights

Advanced details to individually use BioDeepRank and Attn-DeepRank

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages