Welcome to the central repository for the ODSC West 2024 Hackathon with NVIDIA!
❔ For more information on the hackathon itself, check out this webpage or this FAQ. ❔
Your goal in this Hackathon is to train google/gemma-2-2b
using PEFT LoRA on a legal tag-classification task. You'll be using the Law-StackExchange as the base dataset for this task.
You will use NeMo Curator to curate data and NeMo FW to customize it and then evaluate your model!
You are free to:
- Modify training hyperparameters
- Modify, Augment (with SDG, etc) the training dataset
- Modify the NeMo Curator curation pipeline
Your (or your team's) scores will be based on multi-label F1 scores, determined by comparing your generated predictions on the submission dataset against the held-out labels.
In case of the ties, we will be using the videos you submitted to gauge your understanding of the data, NeMo curator, and NeMo framework:
- Understanding the data and usage of the NeMo Curator
- Deep understanding of data processing pipeline. Usage of the most relevant data processing steps.
- Understanding the fine-tuning and usage of the NeMo framework
- Excellent grasp of fine-tuning techniques and using various hyperparameters for optimal model accuracy and customization.
The repository will guide you through a boilerplate example of NeMo Curator curation pipelines and NeMo FW customization, model loading, and inference.
There are a total of three Jupyter Notebooks to work through:
- This notebook will take you through the downloading, processing, and then curating the target dataset
- This notebook will download the model and convert it to a NeMo FW compatible format
- This notebook will go through how to fine-tune the model using PEFT LoRA, and then how to generate submission responses
You must submit (according to this form) in a SINGLE Google Drive:
- Your predicted tag submission
.JSONL
file. - Your LoRA Adapters
- Your notebooks (with outputs)
- A 3min. video explaining your process (code walkthrough not required).
Have fun! 🎉