The end-to-end NER pipeline for information extraction is designed to take user given input and extract a list of entities. The figure above describes how the given text is processed through the pipeline. Following are step-by-step tutorials on running the NER pipeline
If you have git installed on the computer, open a terminal window and download the repository by writing:
cd PATH TO YOUR FOLDER OF CHOICE (e.g. C:/Users/XYZ/)
git clone https://github.com/Aitslab/EasyNER/
Alternatively, you can download the repository from github page https://github.com/Aitslab/EasyNER to your designated folder as a zip file (click on 'Code' in the top right corner and then on 'Download ZIP') and unpack it.
For running the pipeline, anaconda or miniconda must be installed in the computer. The step-by-step installation instructions can be found on: https://docs.anaconda.com/anaconda/install/index.html.
To install the necessary packages for running the environment, open a conda terminal ("Anaconda prompt" in the Windows program window) and navigate to the EasyNER folder you downloaded using the change directory command (cd). For example:
(base) C:\Users\YOURUSERNAME>cd C://Users//YOURUSERNAME//Documents//git_repos//EasyNER
Then create the environment by writing the following command:
conda env create -f environment.yml
After installation, load the environment in the conda terminal with this:
conda activate easyner_env
The pipeline consists of several modules which are run in a sequential manner. It is also possible to run the modules individually.
For each pipeline run, the config.json file in the repository needs to be modified with the desired settings. This can be done in any text editor. First, the modules that you want to run, should be switched to "false" in the ignore section. Then, the section for those modules should be modified as required. It is advisable to save a copy of the modified config file somewhere so you have a permanent record of the run.
{
"ignore": {
"cord_loader": true,
"downloader": true,
"text_loader": true,
"pubmed_bulk_loader": false,
"splitter": true,
"ner": true,
"analysis": false,
"merger": true,
"add_tags": true,
"re": true,
"metrics": true
},
In a normal pipeline run, the following modules should be set to false, and the rest to true:
- One of the data loaders depending on the input type (downloader, cord_loader, free_text loader or pubmed_bulk_loader).
- splitter
- ner
- analysis
The following sections will provide more detail on each of the modules.
The pipeline has four diffent modules for data loading, which handle different input types:
- List of Pubmed IDs => Downloader module
- Pubmed database => Pubmed bulk loader module
- CORD-19 metadata.csv file => CORD loader module
- Free text => Text loader module
This downloader variant of the data loader module takes a single .txt file with pubmed IDs (one ID per row) as input and uses an API to retrieve abstracts from the Pubmed database. The output consists of a single JSON file with all titles and abstracts for the selected IDs.
As example for the input, look at the file "Lund-Autophagy-1.txt". The easiest way to create such a file is to perform a search on Pubmed and then save the search results using the "PMID" format option:
To run the downloader module, change "downloader" in the ignore section to false (cord_loader, text_loader and pubmed_bulk_loader to true) and provide the following arguments in the "downloader" section of the config file:
"input_path": path to file with pubmed IDs
"output_path": path to storage location for output
"batch_size": number of article records downloaded in each call to API. Note that, too large batch size may result in invalid download requests.
The PubMed bulk loader variant of the dataloader module downloads the annual baseline of the complete abstract collection from PubMed database and converts it into multiple, pre-batched JSON files. The user can also specify to download nightly update files alongside the annual baseline. Similar to the other loader modules, the output_path should be provided in the config files. The file structure can be seen here: https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
Similar to other data loader modules, to run the text_loader script turn "pubmed_bulk_loader" in the ignore section to false (and data_loader, cord_loader and text_loader to true) and provide the following arguments:
"output_path": path to save processed files in (in JSON format),
"baseline": The pubmed annual baseline number, which is the year contained in the file names listed on https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/, e.g. 24 in pubmed24n0001.xml.gz,
"subset": if a subset of the baseline is to be downloaded, this should be set to "true", otherwise "false" downloads the entire baseline,
"subset_range":Specify a range if a subset of files is to be downloaded, ex: to download files numbered 0 to 160 (inclusive) add [0,160],
"get_nightly_update_files": set "true" if nightly update files are to be downloaded alongside the annual baseline, otherwise set false. Note that a range must be provided.
"update_file_range": if get_nightly_update_files is set to true, a range must be provided, ex: [1167,1298] to download files 1167 to 1298 (inclusive). This MUST be provided by the user. To see the available range of files, check: https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/
"count_articles": "true" if number of articles within each file is to be counted. Set "false" otherwise.
"raw_download_path": temporary folder where files should be downloaded. Defaults to "data/tmp/pubmed/"
The cord_loader variant of the data loader module processes titles and abstracts in the CORD-19 dataset, a large collection of SARS-CoV2-related articles updated until 2022-06-02. For the CORD loader to work, the CORD19 dataset, which includes the metadata.csv file processed by the pipeline, first needs to be downloaded manually from the CORD-19 website (direct download link). The file path to the metadata.csv file should then be provided in the config file as input. By default, the module will process all titles and abstracts in the CORD-19 dataset (approximately 700 000 records). If a smaller subset is to be processed, a .txt file with the selected cord UIDs, which can be extracted from the metadata.csv file, needs to be provided. To run the CORD loader script, turn "cord_loader" in the ignore section to false (and downloader, pubmed_bulk_loader and text_loader to true) and provide the following arguments:
"input_path": input file path with CORD-19 metadata.csv file
"output_path": path to storage location for output
"subset": true or false - whether a subset of the CORD-19 data is to be extracted.
"subset_file": input file path to a file with cord UIDs if subset option is set to true
The text_loader variant of the dataloader module processess a file with free text and converts it into a JSON file. Similar to data_loader and cord_loader, the file path should be provided in the config files. The output JSON file will contain entries with prefix and a number as IDs and the filename as title. The number randomly assigned. To run the text_loader script, turn "text_loader" in the ignore section to false (and downloader, pubmed_bulk_loader and cord_loader to true) and provide the following arguments:
"input_path": input file path with free text. The folder may contain one or several .txt files.
"output_path": output file (JSON format)
"prefix": Prefix for the free-text files.
This module loads a JSON file (normally the file produced by the data loader module) and splits the text(s) into sentences with the spaCy or NLTK sentence splitter. The output is stored in one or several JSON files. To run the sentence splitter module set the ignore parameter for splitter in the config file to false. When using the spaCy option, the user needs to choose the model: "en_core_web_sm" or "en_core_web_trf". The number of texts that is processed together and stored in the same JSON output file is specified under "batch size".
"input_path": input file path of document collection
"output_folder": output folder path where each bach will be saved
"output_file_prefix": user-set prefix for output files
"tokenizer": "spacy" or "nltk"
"model_name": "en_core_web_sm" or "en_core_web_trf" for spaCy, for nltk write ""
"batch_size": number of texts to be processed together and saved in the same JSON file
"pubmed_bulk": make "true" if pubmed_bulk_loader is used, otherwise use "false"
The NER module performs NER on JSON files containing texts split into sentences (normally the output files from the sentence splitter module). The user can use deep learning models or use the spaCy phrasematcher with dictionaries for NER. Setveral BioBERT-based models fine-tuned on the HUNER corpora collections and several dictionaries are available with the pipeline but the user can also provide their own. To run this module, the ignore argument for ner should be set to false and the following config arguments should be specified in the config file:
"input_path": input folder path where all JSON batch files with texts split into sentences are located
"output_folder": output folder path where each batch will be saved
"output_file_prefix": user-set prefix for tagged output files
"model_type": type of model; the user can choose between "biobert_finetuned" (deep learning models) and "spacy_phrasematcher" (dictionary-based NER)
"model_folder": folder where model is located. For huggingface models use the repo name instead. Eg. "aitslab"
"model_name": name of the model file located in the model folder or repository.
"vocab_path": path to dictionary (if this option is used)
"store_tokens":"no",
"labels": if specific lavels are to be provided, e.g. ["[PAD]", "B", "I", "O", "X", "[CLS]", "[SEP]"],
"clear_old_results": overwrite old results
"article_limit": if user decides to only choose a range of articles to run the model on, default [-1,9000]
"entity_type": type of extracted entity, e.g. "gene"
BioBERT-based NER
- Cell-lines: biobert_huner_cell_v1
- Chemical: biobert_huner_chemical_v1
- Disease: biobert_huner_disease_v1
- Gene/protein: biobert_huner_gene_v1
- Species: biobert_huner_species_v1
The BioBERT models above have been fine-tuned using the HUNER corpora and uploaded to huggingface hub. These and similar models can be loaded from the huggingface hub by setting the "model_path" to "aitslab" and "model_name" to the model intended for use in the NER section of the config file. For example:
"model_type": "biobert_finetuned",
"model_path": "aitslab",
"model_name": "biobert_huner_chemical_v1"
Spacy Phrasematcher is used to load dictionaries and run NER. COVID-19 related disease and virus dictionaries are provided here. Dictionary based NER can be run by specifying model_type as "spacy_phrasematcher", "model_name" as the spacy model (like, "en_core_web_sm" model) and specifying the "vocab_path" (path_to_dictionary) in the NER section of the config file. For example:
"model_type": "spacy_phrasematcher",
"model_path": "",
"model_name": "en_core_web_sm",
"vocab_path": "dictionaries/sars-cov-2_synonyms_v2.txt"
This section uses the extracted entities to generate a file of ranked entities and frequency plots. First, as all the other steps above, set ignore analysis to false. Then use the following input and output config arguments:
"input_path": input folder path where all batches of NER are located,
"output_path": output folder path where the analysis files will be saved,
"entity_type": type of entity, this will be added as a prefix to the output file and bar graph,
"plot_top_n": plot top n entities. defaults to 50. Note that plotting more than 100 entities can result in a distorted graph
- File with ranked entity list:
The generated output file contains the following columns:
Column | Description |
---|---|
entity | Name of the entity |
total_count | total occurances in the entire document set |
articles_spanned | no of articles the entity is found |
batches_spanned | no of batches the entity is found |
freq_per_article | total_count/articles_spanned |
freq_per_batch | total_count/batches_spanned |
batch_set | batch IDs where the entity is found |
batch_count | no of times the entity is found in each batch |
articles_set | article IDs where the entity is found |
- Bar graph of frequencies:
The metrics module can be used to get precision, recall and F1 scores of between a true and a prediction file, as long as both are in IOB2 format. Note that BioBERT raw test prediction file is in IOB2 format. To run metrics, set ignore metrics to false in the config file. Then use the following input and output config arguments:
"predictions_file": file containing predictions by the chosen model (in IOB2 format),
"true_file": file containing true (annotated) values (also in IOB2 format),
"output_file": file containing precision, recall and f1 scores,
"pred_sep": seperator for predictions file, default is " ",
"true_sep": seperator for true annotations file, default is " "
precision recall f1-score support
_ 0.67557 0.65274 0.66396 1627
micro avg 0.67557 0.65274 0.66396 1627
macro avg 0.67557 0.65274 0.66396 1627
weighted avg 0.67557 0.65274 0.66396 1627
The merger section combines results from multiple NER module runs into a single file for analysis. First, as all the other steps above, set ignore analysis to false. Then use the following input and output config arguments:
"input_paths": list of input folder path where the files are saved. for example: ["path/to/cell/model/files/", "path/to/chemical/model/files/", "path/to/disease/model/files/"]
"entities": list of entities correcponding to the models. For example: ["cell", "chemical", "disease"]
"output_path": output path where the medged file will be saved
When the configuration is saved, the pipeline can be executed by activating the easyner_env environment, navigating to the easyner folder with "cd" as above, and running the main.py file in the conda terminal:
conda activate easyner_env
python main.py