Merge pull request #73 from NeotomaDB/14-fine-tune-spacy-ner-model

Update prelabeling scripts
NeotomaDB · Jun 26, 2023 · db9f702 · db9f702
2 parents b9e2415 + 418a5d6
commit db9f702
Show file tree

Hide file tree

Showing 15 changed files with 322 additions and 83 deletions.
diff --git a/src/entity_extraction/assets/account_creation.png b/src/entity_extraction/assets/account_creation.png
diff --git a/src/entity_extraction/assets/correct_labels.png b/src/entity_extraction/assets/correct_labels.png
diff --git a/src/entity_extraction/assets/global_index.png b/src/entity_extraction/assets/global_index.png
diff --git a/src/entity_extraction/assets/green_tab.png b/src/entity_extraction/assets/green_tab.png
diff --git a/src/entity_extraction/assets/labeling.png b/src/entity_extraction/assets/labeling.png
diff --git a/src/entity_extraction/assets/labeling_instructions.png b/src/entity_extraction/assets/labeling_instructions.png
diff --git a/src/entity_extraction/assets/labeling_instructions_button.png b/src/entity_extraction/assets/labeling_instructions_button.png
diff --git a/src/entity_extraction/assets/labelstudio_tab.png b/src/entity_extraction/assets/labelstudio_tab.png
diff --git a/src/entity_extraction/assets/org_nav.png b/src/entity_extraction/assets/org_nav.png
diff --git a/src/entity_extraction/assets/settings.png b/src/entity_extraction/assets/settings.png
diff --git a/src/entity_extraction/spacy_entity_extraction.py b/src/entity_extraction/spacy_entity_extraction.py
@@ -8,14 +8,13 @@
 # ensure that the parent directory is on the path for relative imports
 sys.path.append(os.path.join(os.path.dirname(__file__), os.pardir, os.pardir))
 
+from src.logs import get_logger
+# logger = logging.getLogger(__name__)
+logger = get_logger(__name__)
 
-def spacy_extract_all(text: str,
-                            ner_model=None,
-                            model_path=os.path.join(
-                                os.pardir,
-                                "models",
-                                "v1",
-                                "transformer")):
+def spacy_extract_all(  
+    text: str,
+    ner_model=None):
     """
     Extracts entities from text using a spacy model
 
@@ -25,8 +24,6 @@ def spacy_extract_all(text: str,
         The text to extract entities from
     ner_model : spacy model
         The spacy model to use for entity extraction
-    model_path : str
-        The path to the spacy model to use for entity extraction
     
     Returns
     -------
@@ -35,8 +32,8 @@ def spacy_extract_all(text: str,
     """
 
     if ner_model == None:
-        spacy.require_cpu()
-        ner_model = spacy.load(model_path)
+        logger.info("Empty model passed, return 0 labels.")
+        return []
 
     entities = []
     doc = ner_model(text)

diff --git a/src/entity_extraction/training/spacy_ner/run_spacy_training.sh b/src/entity_extraction/training/spacy_ner/run_spacy_training.sh
@@ -26,11 +26,7 @@ python3 src/preprocessing/labelling_data_split.py \
         --val_split $VAL_SPLIT \
         --test_split $TEST_SPLIT
 
-python3 src/preprocessing/spacy_preprocess.py \
-        --data_path $DATA_OUTPUT_PATH \
-        --train_split $TRAIN_SPLIT \
-        --val_split $VAL_SPLIT \
-        --test_split $TEST_SPLIT
+python3 src/preprocessing/spacy_preprocess.py --data_path $DATA_OUTPUT_PATH
 
 if [ -z "$MODEL_PATH" ]; then
     # If the model path is null, then start training from scratch

diff --git a/src/preprocessing/README.md b/src/preprocessing/README.md
@@ -0,0 +1,129 @@
+# **Preprocessing**
+
+This readme file provides an overview of the multiple scripts under the **preprocessing** directory. It contains various Python scripts that are used for various data preprocessing tasks, from labeling to training. Below, you will find information about the purpose of the scripts and instructions on how to use them.
+
+**Table of Contents**
+- [Setup](#setup)
+- [Usage](#usage)
+    - [Tagging](#labelling-preprocessing)
+        - [Labelling Preprocessing](#labelling-preprocessing)
+    - [Training](#)
+        - [Labelling Data Splitting](#labelling-data-splitting)
+        - [SpaCy Preprocessing](#spacy-preprocessing)
+
+---
+
+## **Setup**
+Feel free to explore each script to understand their functionalities. These python scripts are part of bash scripts that execute a set of steps, and hence need not be executed independently. However, utilize them based on your specific preprocessing needs using the commands below. 
+
+To use the preprocessing scripts, follow these steps:
+
+1. Ensure that you have the environment enabled and dependencies installed.
+
+2. Place your input data files in the folder and provide the appropriate paths as input arguments.
+
+3. Specify other mandatory parameters (as defined below).
+
+5. Run the script to execute the preprocessing tasks on your data.
+
+---
+
+## **Usage**
+---
+
+### **Labelling Preprocessing**
+To use the `labelling_preprocessing.py` script, execute the command below and replace all the appropriate input arguments:
+
+```bash
+python3 labelling_preprocessing.py --model_version <model_version> --output_path <output_path> [--model_path <model_path>] [--data_path <--data_path>][--bib_path <bib_path>] [--sentences_path <sentences_path>] [--char_len <char_len>] [--min_len <min_len>]
+```
+
+#### **Description**
+
+This script takes the original article text as input and generates labels resulting in JSON files that can be uploaded to LabelStudio for further annotation or analysis. It performs the following tasks:
+
+1. Splits the article text into smaller chunks based on the specified character length.
+
+2. Creates JSON files for each chunk, containing the required fields for LabelStudio.
+
+3. Assigns a unique identifier to each chunk.
+
+4. Adds metadata from the bibjson file, if provided.
+
+5. Utilizes a specified model version, if provided, to generate labels.
+
+#### **Options**
+
+- `--model_version=<model_version>`: Specify the model version used to generate labels.
+
+- `--output_path=<output_path>`: Specify the path to the output directory where the generated labels will be stored for uploading to LabelStudio.
+
+- `--model_path=<model_path>` (optional): Specify the path to the model artifacts to use for label generation. If not specified, only chunking is performed.
+
+- `--data_path=<--data_path>` (optional): Specify the path to a CSV file containing full text articles.
+
+- `--bib_path=<bib_path>` (optional): Specify the path to the bibjson file containing article metadata.
+
+- `--sentences_path=<sentences_path>` (optional): Specify the path to the sentences_nlp file that contains all sentences as returned by xDD.
+
+- `--char_len=<char_len>` (optional): Specify the desired length (in characters) for each chunk when splitting a section of the article. Default value is 4000.
+
+- `--min_len=<min_len>` (optional): Specify the minimum length (in characters) for a section. If a section is smaller than this value, it will be combined with the next section. Default value is 1500.
+
+Note: Either `--data_path` or both `--bib_path` and `--sentences_path` must be specified to locate the input data to preprocess for labeling.
+
+---
+
+### **Labelling Data Splitting**
+
+To use the `labelling_data_split.py` script, execute the command below and replace all the appropriate input arguments:
+
+```bash
+python3 labelling_data_split.py --raw_label_path=<raw_label_path> --output_path=<output_path> [--train_split=<train_split>] [--val_split=<val_split>] [--test_split=<test_split>]
+```
+
+#### **Description**
+This script takes labelled dataset in JSONLines format as input and splits it into separate train, validation, and test sets. It performs the following tasks:
+
+1. Reads the labelled text from the specified `raw_label_path` directory.
+
+2. Randomly divides the data into train, validation, and test sets based on the provided split percentages.
+
+3. Writes the divided datasets into separate folders as separate JSON files in the specified `output_path` directory.
+
+The resulting train, validation, and test sets can be used for training and evaluating machine learning models.
+
+#### **Options**
+- `--raw_label_path=<raw_label_path>`: Specify the path to the directory where the raw label files are located.
+
+- `--output_path=<output_path>`: Specify the path to the directory where the output files will be written.
+
+- `--train_split=<train_split>` (optional): Specify the percentage of examples to dedicate to the train set. The default value is 0.7 (70%).
+
+- `--val_split=<val_split>` (optional): Specify the percentage of examples to dedicate to the validation set. The default value is 0.15 (15%).
+
+- `--test_split=<test_split>` (optional): Specify the percentage of examples to dedicate to the test set. The default value is 0.15 (15%).
+
+---
+
+### **SpaCy Preprocessing**
+
+To use the `spacy_preprocess.py` script, execute the command below and replace all the appropriate input arguments:
+
+```bash
+python3 spacy_preprocess.py --data_path=<data_path>
+```
+
+#### **Description**
+This script manages the creation of custom data artifacts required for training and fine-tuning spaCy models. It performs the following tasks:
+
+1. Reads the dataset in JSONLines format from the specified `data_path`.
+
+2. Creates spans of entities from the labelled files.
+
+3. Converts the tagged data into spaCy-compatible format, such as converting it to Doc or DocBin objects.
+
+4. Creates the custom data artifacts that can be used for training or fine-tuning spaCy models.
+
+#### **Options**
+- `--data_path=<data_path>`: Specify the path to the folder containing files in JSONLines format.