This repo contains a notebook for fine tuning the LLM based on the DPO (Direct Preference Optimization). To reduce the trainable parameters we also incorporate LoRA (Low-Rank Adaptation of Large Language Models). The model is trained based on Orca-Direct-Preference-Optimization which consists of the user choices for a pair of answers from the LLM for a given prompt.
For a microsoft/phi-2 model there are: trainable params: 4,792,320 || all params: 2,784,476,160 || trainable%: 0.1721