This repository provides the PyTorch implementation of BioBERT. You can easily use BioBERT with transformers. This project is supported by the members of DMIS-Lab @ Korea University including Jinhyuk Lee, Wonjin Yoon, Minbyul Jeong, Mujeen Sung, and Gangwoo Kim.
# Install huggingface transformers
pip install transformers==3.0.0
# Download all datasets including NER/RE/QA
./download.sh
Note that you should also install torch
(see download instruction) to use transformers
.
If the download script does not work, you can manually download the datasets here which should be unzipped in the current directory (tar -xzvf datasets.tar.gz
).
We provide following versions of BioBERT in PyTorch (click here to see all).
You can use BioBERT in transformers
by setting --model_name_or_path
as one of them (see example below).
dmis-lab/biobert-base-cased-v1.2
: Trained in the same way as BioBERT-Base v1.1 but includes LM head, which can be useful for probingdmis-lab/biobert-base-cased-v1.1
: BioBERT-Base v1.1 (+ PubMed 1M)dmis-lab/biobert-large-cased-v1.1
: BioBERT-Large v1.1 (+ PubMed 1M)dmis-lab/biobert-base-cased-v1.1-mnli
: BioBERT-Base v1.1 pre-trained on MNLIdmis-lab/biobert-base-cased-v1.1-squad
: BioBERT-Base v1.1 pre-trained on SQuADdmis-lab/biobert-base-cased-v1.2
: BioBERT-Base v1.2 (+ PubMed 1M + LM head)
For other versions of BioBERT or for Tensorflow, please see the README in the original BioBERT repository. You can convert any version of BioBERT into PyTorch with this.
For instance, to train BioBERT on the NER dataset (NCBI-disease), run as:
# Pre-process NER datasets
cd named-entity-recognition
./preprocess.sh
# Choose dataset and run
export DATA_DIR=../datasets/NER
export ENTITY=NCBI-disease
python run_ner.py \
--data_dir ${DATA_DIR}/${ENTITY} \
--labels ${DATA_DIR}/${ENTITY}/labels.txt \
--model_name_or_path dmis-lab/biobert-base-cased-v1.1 \
--output_dir output/${ENTITY} \
--max_seq_length 128 \
--num_train_epochs 3 \
--per_device_train_batch_size 32 \
--save_steps 1000 \
--seed 1 \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir
Please see each directory for different examples. Currently, we provide
- embedding/: BioBERT embedding.
- named-entity-recognition/: NER using BioBERT.
- question-answering/: QA using BioBERT.
- relation-extraction/: RE using BioBERT.
Most examples are modifed from examples in Hugging Face transformers.
@article{lee2020biobert,
title={BioBERT: a pre-trained biomedical language representation model for biomedical text mining},
author={Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo},
journal={Bioinformatics},
volume={36},
number={4},
pages={1234--1240},
year={2020},
publisher={Oxford University Press}
}
Please see the LICENSE file for details. Downloading data indicates your acceptance of our disclaimer.
For help or issues using BioBERT-PyTorch, please create an issue.