Skip to content

Commit

Permalink
Commit for tutorial on using oversampled data for training
Browse files Browse the repository at this point in the history
  • Loading branch information
shubham-s-agarwal committed Aug 5, 2024
1 parent c51d0f1 commit 7f6a8bf
Showing 1 changed file with 80 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f1ac6acb",
"metadata": {},
"source": [
"# Oversampling data"
]
},
{
"cell_type": "markdown",
"id": "9a431cf0",
"metadata": {},
"source": [
"You can generate synthetic data to help mitigate class imbalance. <br> Use this code to generate synthetic data using LLM - [link](https://gist.github.com/shubham-s-agarwal/401ef8bf6cbbd66fa0c76a8fbfc1f6c4) <br> <b>NOTE</b>: the generated data will require manual quality check to ensure that high quality and relevant data is used for training. "
]
},
{
"cell_type": "markdown",
"id": "bd8066b5",
"metadata": {},
"source": [
"The data generated from the gist code and the format of the data required by MedCAT are different, requiring manual formatting at the moment. We will update this module to include the code to handle the same."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d5860552",
"metadata": {},
"outputs": [],
"source": [
"# Refer to the meta_annotation_training notebook for the initial steps"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "949299e4",
"metadata": {},
"outputs": [],
"source": [
"# To run the training with original + synthetic data\n",
"# Follow all the same steps till initializing the metacat model\n",
"\n",
"# Initialise and train meta_model\n",
"mc = MetaCAT(tokenizer=tokenizer, embeddings=None, config=config)\n",
"\n",
"# the format expected is [[['text','of','the','document'], [index of medical entity], \"label\" ],\n",
"# ['text','of','the','document'], [index of medical entity], \"label\" ]]\n",
"\n",
"synthetic_data_export = [[],[],[]]\n",
"\n",
"results = mc.train_from_json(mctrainer_export_path, save_dir_path=save_dir_path,data_oversampled=synthetic_data_export)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 7f6a8bf

Please sign in to comment.