Fine-tuning HuggingFace GPTJ-6B with U.K. Supreme Court Case documents for domain specific fine-tuning
This repo contains a notebook that will walk you through how to fine-tune a pre-trained large language model with domain specific knowledge.
The domain specific dataset that we will be using to fine-tune this model will be from United Kingdom (U.K.) Supreme Court case documents. We will tune the model on roughly 693 legal documents.
To run this notebook we assume you have knowledge about running a SageMaker Notebook instance or SageMaker Studio Notebook instance.
Introduction to Amazon SageMaker Studio - Video
Build ML models using SageMaker Studio Notebooks - Workshop
The stats. below are if you were to use all 693 case documents to tune the model.
- Page count: ~17,718
- Word count: 10,015,333
- Characters (no spaces): 49,897,639
The entire dataset is available to be downloaded here
The notebook has been configured to allow you to use only a subset of the entire dataset to fine-tune the model if desired. In the Data Prep section, there is a variable called doc_count. You can set this number to your preference, and the model will be fine-tuned based on that specific number of cases from the dataset. The smaller the value you set for this variable, the faster the model will fine-tune.
Here are the estimated training times based on total number of case documents in the training dataset. Note the training time is based on training for 3 epochs.
Training time: 1 hour 41 minutes
Training time: 2 hours 57 minutes
Training time: 4 hours
Steps you will go through in the notebook to test the base model
- Clone this repo in a SageMaker Studio Jupyter notebook
- Install needed notebook libraries
- Configure the notebook to use SageMaker
- Retrieve base model container
- Deploy the model inference endpoint
- Call inference endpoint to retrieve results from the LLM
Steps you will go through in the notebook to test the fine-tuned model
- Download dataset
- Prep the dataset and upload it to S3
- Retrieve the base model container
- Set hyperparameters for fine-tuning
- Start training/tuning job
- Deploy inference endpoint for the fine-tuned model
- Call inference endpoint for the fine-tuned model
- Parse endpoint results
- Be sure you delete all models and endpoints to avoid incurring unneeded spend.
This notebook demos how you can fine-tune an LLM using transfer learning. Even though this notebook is fine-tuned using actual (U.K.) Supreme Court case documents you should not use this notebook for legal advise.
To run the notebook clone this repo in a SageMaker Notebook instance or SageMaker Studio Notebook.