Skip to content

This repo walks you through how to use transfer learning to fine tune a LLM (large language model) using UK Supreme Court case law as the domain specific dataset. The model being fine-tuned is the HuggingFace GPTJ-6B model.

License

Notifications You must be signed in to change notification settings

sheldonlsides/fine-tuning-llm-with-domain-knowledge

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-tuning HuggingFace GPTJ-6B with U.K. Supreme Court Case documents for domain specific fine-tuning

Overview

This repo contains a notebook that will walk you through how to fine-tune a pre-trained large language model with domain specific knowledge.

The domain specific dataset that we will be using to fine-tune this model will be from United Kingdom (U.K.) Supreme Court case documents. We will tune the model on roughly 693 legal documents.

Prereqs

To run this notebook we assume you have knowledge about running a SageMaker Notebook instance or SageMaker Studio Notebook instance.

SageMaker Studio Resources

Introduction to Amazon SageMaker Studio - Video

Build ML models using SageMaker Studio Notebooks - Workshop

Dataset info

The stats. below are if you were to use all 693 case documents to tune the model.

  • Page count: ~17,718
  • Word count: 10,015,333
  • Characters (no spaces): 49,897,639

The entire dataset is available to be downloaded here

Considerations when fine-tuning the model

The notebook has been configured to allow you to use only a subset of the entire dataset to fine-tune the model if desired. In the Data Prep section, there is a variable called doc_count. You can set this number to your preference, and the model will be fine-tuned based on that specific number of cases from the dataset. The smaller the value you set for this variable, the faster the model will fine-tune.

Training/Tuning Time estimates

Here are the estimated training times based on total number of case documents in the training dataset. Note the training time is based on training for 3 epochs.

All training was ran on 1 - ml.p3dn.24xlarge instance

Training dataset document count 250

Training time: 1 hour 41 minutes

Training document count 500

Training time: 2 hours 57 minutes

Training document count 693

Training time: 4 hours

GPTJ-6B base model

Steps you will go through in the notebook to test the base model

  1. Clone this repo in a SageMaker Studio Jupyter notebook
  2. Install needed notebook libraries
  3. Configure the notebook to use SageMaker
  4. Retrieve base model container
  5. Deploy the model inference endpoint
  6. Call inference endpoint to retrieve results from the LLM

Fine-tuned model

Steps you will go through in the notebook to test the fine-tuned model

  1. Download dataset
  2. Prep the dataset and upload it to S3
  3. Retrieve the base model container
  4. Set hyperparameters for fine-tuning
  5. Start training/tuning job
  6. Deploy inference endpoint for the fine-tuned model
  7. Call inference endpoint for the fine-tuned model
  8. Parse endpoint results

Final Step

  • Be sure you delete all models and endpoints to avoid incurring unneeded spend.

Disclaimer

This notebook demos how you can fine-tune an LLM using transfer learning. Even though this notebook is fine-tuned using actual (U.K.) Supreme Court case documents you should not use this notebook for legal advise.

Running notebook

To run the notebook clone this repo in a SageMaker Notebook instance or SageMaker Studio Notebook.

Go to Notebook

About

This repo walks you through how to use transfer learning to fine tune a LLM (large language model) using UK Supreme Court case law as the domain specific dataset. The model being fine-tuned is the HuggingFace GPTJ-6B model.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%