-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Commit for tutorial on using oversampled data for training
- Loading branch information
1 parent
c51d0f1
commit 7f6a8bf
Showing
1 changed file
with
80 additions
and
0 deletions.
There are no files selected for viewing
80 changes: 80 additions & 0 deletions
80
medcat/2_train_model/2_supervised_training/meta_annotation_training_with_oversampling.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "f1ac6acb", | ||
"metadata": {}, | ||
"source": [ | ||
"# Oversampling data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "9a431cf0", | ||
"metadata": {}, | ||
"source": [ | ||
"You can generate synthetic data to help mitigate class imbalance. <br> Use this code to generate synthetic data using LLM - [link](https://gist.github.com/shubham-s-agarwal/401ef8bf6cbbd66fa0c76a8fbfc1f6c4) <br> <b>NOTE</b>: the generated data will require manual quality check to ensure that high quality and relevant data is used for training. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "bd8066b5", | ||
"metadata": {}, | ||
"source": [ | ||
"The data generated from the gist code and the format of the data required by MedCAT are different, requiring manual formatting at the moment. We will update this module to include the code to handle the same." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "d5860552", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Refer to the meta_annotation_training notebook for the initial steps" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "949299e4", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# To run the training with original + synthetic data\n", | ||
"# Follow all the same steps till initializing the metacat model\n", | ||
"\n", | ||
"# Initialise and train meta_model\n", | ||
"mc = MetaCAT(tokenizer=tokenizer, embeddings=None, config=config)\n", | ||
"\n", | ||
"# the format expected is [[['text','of','the','document'], [index of medical entity], \"label\" ],\n", | ||
"# ['text','of','the','document'], [index of medical entity], \"label\" ]]\n", | ||
"\n", | ||
"synthetic_data_export = [[],[],[]]\n", | ||
"\n", | ||
"results = mc.train_from_json(mctrainer_export_path, save_dir_path=save_dir_path,data_oversampled=synthetic_data_export)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |