🎉 A light-weighted UMLS-based data augmentation for biomedical NLP tasks including Named Entity Recognition and sentence classification 🎉
- Citation: Kang, T., Perotte, A., Tang, Y., Ta, C., & Weng, C. (2020). UMLS-based data augmentation for natural language processing of clinical research literature. Journal of the American Medical Informatics Association.
- Author: Tian Kang (tk2624@cumc.columbia.edu)
- Affiliation: Department of Biomedical Informatics, Columbia Univerisity (Dr. Chunhua Weng's lab)
- Built upon EDA-Easy Data Augmentation
- Install 'UMLS' and 'QuickUMLS' locally
- Get your UMLS SOAP API Key from the UTS ‘My Profile’ area after signing in UMLS Terminology service
- Add your API Key and QuickUMLS directory to the
config.py
. - Costomzie other variables in the
config.py
- Input: CoNLL format file
- Usage:
python augment4ner.py [-h] --input INPUT [--output OUTPUT] [--num_aug NUM_AUG] [--alpha ALPHA]
- Input: "|" seperated file (
index|label|sentence text
) - Usage:
python augment4class.py [-h] --input INPUT [--output OUTPUT] [--num_aug NUM_AUG] [--alpha ALPHA]
See examples/example4ner.conll
and example/example4class.txt
- Wei, J. and Zou, K., 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196. (Github repo)