This project involves fine-tuning the Llama2-7B model from scratch using LoRA and QLoRA techniques. The goal is to generate structured SOAP (Subjective, Objective, Assessment, Plan) notes from patient-doctor conversations. The dataset used for training comprises transcribed medical dialogues that follow the SOAP note format.
- Introduction
- Prerequisites
- Installation
- Dataset
- Fine-Tuning Process
- Evaluation
- Usage
- Results
- License
SOAP notes are a method of documentation employed by healthcare providers to write out notes in a patient's chart, along with other common formats. This project automates the generation of SOAP notes from patient-doctor conversations using a fine-tuned Llama2-7B model. The model leverages Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) for efficient training.
- Python 3.11 or higher
- PyTorch 1.10.0 or higher
- CUDA 10.2 or higher (for GPU support)
-
Clone the repository:
git clone https://github.com/aman-17/MediSOAP.git cd MediSOAP
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
The dataset used for this project is a collection of patient-doctor conversation transcripts formatted into SOAP notes. The dataset must be preprocessed into the required format before training.
To preprocess your custom dataset, follow the format of train.jsonl, then:
- Place your raw data files in the
data/
directory. - Run the preprocessing script:
python data_preprocessing.py
Fine-tuning involves adapting the pre-trained Llama2-7B and phi2 model to our specific task using LoRA technique.
-
Data Preparation: Ensure your preprocessed data is in the
data/
directory. -
Training: Run the training script:
python train_phi2.py
Evaluate the model's performance on a test dataset:
python evaluate.py --model-path path/to/fine-tuned-model --test-data path/to/test-data
Metrics such as BLEU, ROUGE, and accuracy can be used to assess the model's performance.
To generate SOAP notes from new patient-doctor conversations, use the inference script:
python generate.py --model-path path/to/fine-tuned-model --input path/to/conversation.txt
The output will be a structured SOAP note based on the input conversation.
Summarize the results obtained from the model's performance on the test dataset, including key metrics and example outputs.
We welcome contributions from the community. To contribute, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Make your changes.
- Commit your changes (
git commit -m 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Create a new Pull Request.
This project is licensed under the MIT License. See the LICENSE file for details.
Feel free to update this README with additional details as needed.