Accepted by NAACL 2024 Main Conference (Oral Presentation), Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences
-
Python 3.8 (Ubuntu 20.04)
-
PyTorch 1.11.0 & CUDA 11.3
Here is some basic steps to setup the environment.
Step1: Create an unique Conda environment and install Python and PyTorch with CUDA support of specified version.
conda create -n [ENV_NAME] python=3.8
conda install pytorch torchvision torchaudio pytorch-cuda=11.3 -c pytorch -c nvidia
Step2: Install all the required Python packages for the repository by the following command:
pip install -r requirements.txt
Step3: Install NLTK data. Run the Python interpreter and type the following commands:
>>> import nltk
>>> nltk.download("punkt")
All the datasets involved have been uploaded to Huggingface Lhtie/Bio-Domain-Transfer. Download the datasets by typing the commands:
git lfs install
git clone https://huggingface.co/datasets/Lhtie/Bio-Domain-Transfer
The folder contains biomedical datasets PathwayCuration
, Cancer Genetics
,Infectious Diseases
and chemical datasets CHEMDNER
, BC5CDR
, DrugProt
.
All the models used (BERT
, SapBERT
, S-PubMedBert-MS-MARCO-SCIFACT
) can be download from from Huggingface Repositories:
git lfs install
git clone https://huggingface.co/bert-base-uncased
git clone https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext
git clone https://huggingface.co/pritamdeka/S-PubMedBert-MS-MARCO-SCIFACT
-
dataConfig
contains data process scriptsDataConfig: Modify
dataset_dir
fromdataConfig/config.py
: directory path to datasets (eg../Bio-Domain-Transfer
)ModelConfig: Modify
sapbert_path
,sentbert_path
,bert_path
fromdataConfig/confg.py
: directory path to models respectively -
configs/para
contains configuration files for different experiment senariosfew-shot_bert.yaml
: Target Onlyoracle_bert.yaml
: Target Only with full training datatransfer_learning.yaml
: Direct Transfertransfer_learning_eg.yaml
: EG (Fill inDATA.BIOMEDICAL.SIM_METHOD
to switch betweenconcat
andsentEnc
)transfer_learning_disc.yaml
: EDtransfer_learning_eg_disc.yaml
: EG+ED
Train
Run the train.py
script (Multi-Processing) by the following command:
torchrun --nnodes=1 --nproc_per_node=<# gpus> train.py \
--cfg_file <configuration file> \
Test
Run the eval.py
script to test finetuned models:
python eval.py --cfg_file <configuration file>
@inproceedings{liu-etal-2024-named,
title = "Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences",
author = "Liu, Hongyi and
Wang, Qingyun and
Karisani, Payam and
Ji, Heng",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.1",
pages = "1--21",
}