Skip to content

Commit

Permalink
Merge pull request #73 from NeotomaDB/14-fine-tune-spacy-ner-model
Browse files Browse the repository at this point in the history
Update prelabeling scripts
  • Loading branch information
brabbit61 authored Jun 26, 2023
2 parents b9e2415 + 418a5d6 commit db9f702
Show file tree
Hide file tree
Showing 15 changed files with 322 additions and 83 deletions.
Binary file added src/entity_extraction/assets/account_creation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/entity_extraction/assets/correct_labels.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/entity_extraction/assets/global_index.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/entity_extraction/assets/green_tab.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/entity_extraction/assets/labeling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/entity_extraction/assets/labelstudio_tab.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/entity_extraction/assets/org_nav.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/entity_extraction/assets/settings.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 8 additions & 11 deletions src/entity_extraction/spacy_entity_extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,13 @@
# ensure that the parent directory is on the path for relative imports
sys.path.append(os.path.join(os.path.dirname(__file__), os.pardir, os.pardir))

from src.logs import get_logger
# logger = logging.getLogger(__name__)
logger = get_logger(__name__)

def spacy_extract_all(text: str,
ner_model=None,
model_path=os.path.join(
os.pardir,
"models",
"v1",
"transformer")):
def spacy_extract_all(
text: str,
ner_model=None):
"""
Extracts entities from text using a spacy model
Expand All @@ -25,8 +24,6 @@ def spacy_extract_all(text: str,
The text to extract entities from
ner_model : spacy model
The spacy model to use for entity extraction
model_path : str
The path to the spacy model to use for entity extraction
Returns
-------
Expand All @@ -35,8 +32,8 @@ def spacy_extract_all(text: str,
"""

if ner_model == None:
spacy.require_cpu()
ner_model = spacy.load(model_path)
logger.info("Empty model passed, return 0 labels.")
return []

entities = []
doc = ner_model(text)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,7 @@ python3 src/preprocessing/labelling_data_split.py \
--val_split $VAL_SPLIT \
--test_split $TEST_SPLIT

python3 src/preprocessing/spacy_preprocess.py \
--data_path $DATA_OUTPUT_PATH \
--train_split $TRAIN_SPLIT \
--val_split $VAL_SPLIT \
--test_split $TEST_SPLIT
python3 src/preprocessing/spacy_preprocess.py --data_path $DATA_OUTPUT_PATH

if [ -z "$MODEL_PATH" ]; then
# If the model path is null, then start training from scratch
Expand Down
129 changes: 129 additions & 0 deletions src/preprocessing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# **Preprocessing**

This readme file provides an overview of the multiple scripts under the **preprocessing** directory. It contains various Python scripts that are used for various data preprocessing tasks, from labeling to training. Below, you will find information about the purpose of the scripts and instructions on how to use them.

**Table of Contents**
- [Setup](#setup)
- [Usage](#usage)
- [Tagging](#labelling-preprocessing)
- [Labelling Preprocessing](#labelling-preprocessing)
- [Training](#)
- [Labelling Data Splitting](#labelling-data-splitting)
- [SpaCy Preprocessing](#spacy-preprocessing)

---

## **Setup**
Feel free to explore each script to understand their functionalities. These python scripts are part of bash scripts that execute a set of steps, and hence need not be executed independently. However, utilize them based on your specific preprocessing needs using the commands below.

To use the preprocessing scripts, follow these steps:

1. Ensure that you have the environment enabled and dependencies installed.

2. Place your input data files in the folder and provide the appropriate paths as input arguments.

3. Specify other mandatory parameters (as defined below).

5. Run the script to execute the preprocessing tasks on your data.

---

## **Usage**
---

### **Labelling Preprocessing**
To use the `labelling_preprocessing.py` script, execute the command below and replace all the appropriate input arguments:

```bash
python3 labelling_preprocessing.py --model_version <model_version> --output_path <output_path> [--model_path <model_path>] [--data_path <--data_path>][--bib_path <bib_path>] [--sentences_path <sentences_path>] [--char_len <char_len>] [--min_len <min_len>]
```

#### **Description**

This script takes the original article text as input and generates labels resulting in JSON files that can be uploaded to LabelStudio for further annotation or analysis. It performs the following tasks:

1. Splits the article text into smaller chunks based on the specified character length.

2. Creates JSON files for each chunk, containing the required fields for LabelStudio.

3. Assigns a unique identifier to each chunk.

4. Adds metadata from the bibjson file, if provided.

5. Utilizes a specified model version, if provided, to generate labels.

#### **Options**

- `--model_version=<model_version>`: Specify the model version used to generate labels.

- `--output_path=<output_path>`: Specify the path to the output directory where the generated labels will be stored for uploading to LabelStudio.

- `--model_path=<model_path>` (optional): Specify the path to the model artifacts to use for label generation. If not specified, only chunking is performed.

- `--data_path=<--data_path>` (optional): Specify the path to a CSV file containing full text articles.

- `--bib_path=<bib_path>` (optional): Specify the path to the bibjson file containing article metadata.

- `--sentences_path=<sentences_path>` (optional): Specify the path to the sentences_nlp file that contains all sentences as returned by xDD.

- `--char_len=<char_len>` (optional): Specify the desired length (in characters) for each chunk when splitting a section of the article. Default value is 4000.

- `--min_len=<min_len>` (optional): Specify the minimum length (in characters) for a section. If a section is smaller than this value, it will be combined with the next section. Default value is 1500.

Note: Either `--data_path` or both `--bib_path` and `--sentences_path` must be specified to locate the input data to preprocess for labeling.

---

### **Labelling Data Splitting**

To use the `labelling_data_split.py` script, execute the command below and replace all the appropriate input arguments:

```bash
python3 labelling_data_split.py --raw_label_path=<raw_label_path> --output_path=<output_path> [--train_split=<train_split>] [--val_split=<val_split>] [--test_split=<test_split>]
```

#### **Description**
This script takes labelled dataset in JSONLines format as input and splits it into separate train, validation, and test sets. It performs the following tasks:

1. Reads the labelled text from the specified `raw_label_path` directory.

2. Randomly divides the data into train, validation, and test sets based on the provided split percentages.

3. Writes the divided datasets into separate folders as separate JSON files in the specified `output_path` directory.

The resulting train, validation, and test sets can be used for training and evaluating machine learning models.

#### **Options**
- `--raw_label_path=<raw_label_path>`: Specify the path to the directory where the raw label files are located.

- `--output_path=<output_path>`: Specify the path to the directory where the output files will be written.

- `--train_split=<train_split>` (optional): Specify the percentage of examples to dedicate to the train set. The default value is 0.7 (70%).

- `--val_split=<val_split>` (optional): Specify the percentage of examples to dedicate to the validation set. The default value is 0.15 (15%).

- `--test_split=<test_split>` (optional): Specify the percentage of examples to dedicate to the test set. The default value is 0.15 (15%).

---

### **SpaCy Preprocessing**

To use the `spacy_preprocess.py` script, execute the command below and replace all the appropriate input arguments:

```bash
python3 spacy_preprocess.py --data_path=<data_path>
```

#### **Description**
This script manages the creation of custom data artifacts required for training and fine-tuning spaCy models. It performs the following tasks:

1. Reads the dataset in JSONLines format from the specified `data_path`.

2. Creates spans of entities from the labelled files.

3. Converts the tagged data into spaCy-compatible format, such as converting it to Doc or DocBin objects.

4. Creates the custom data artifacts that can be used for training or fine-tuning spaCy models.

#### **Options**
- `--data_path=<data_path>`: Specify the path to the folder containing files in JSONLines format.
Loading

0 comments on commit db9f702

Please sign in to comment.