✨ Fine-tuning LLMs with `LoRA` and `Hessian-free` optimizers!

download project

# create python venv, e.g.,
python -m venv .venv
source .venv/bin/activate
# clone git repository
git clone https://github.com/shizheng-rlfresh/llm-opt.git
# go to directory, and pip install required dependency
pip install -r requirements.txt

Demo Notebooks available

Optimizer	Notebooks	Optimizer Implementation	Source
Trust Region Newton-CG	gpt2-LoRA-TRCG	optim/trcg.py	[1], [2], [3]
AdaHessian	gpt2-LoRA-AdaHessian	torch-optimizer	AAAI, 2021
OASIS	🛠️	🛠️	ICLR, 2022
AI-SARAH	🛠️	🛠️	TMLR, 2023
Sophia	🛠️	SophiaG	ICLR, 2024
🤔 what else?

Current status:

A simple implementation of Trust Region Newton-CG, aka, TRCG; see optim/trcg.py for details. TRCG is Hessian-free with only Hessian-vector product being needed, and no Hessian!. As LoRA brings down the size of models, ✨ let's give it a shot! 💪
- Trust Region Newton-CG is probably the most underrated 😖 optimizer in machine learning field. It is one of the best optimizers for solving nonconvex problems, e.g., deep neural networks.
- Coupled with preconditioning, Trust Region Newton-CG could yield even more promosing convergence property.
- A BIG UNKNOWN - its general convergence in stochastic setting is yet to be proved ... that means, mini-batch training is not theoretically proved yet in general case.
- A BIG BUT - I loved it, and I can show many successful uses cases using just naive TRCG with DNN, e.g., CNN, GNN, etc.
```
# training loop
for minibatch in train_dataloader:
    # ...
    outputs = model(**minibatch)
    # explicitly define lora param
    lora_param = [w for w in model.parameters() if w.requires_grad]
    # compute gradient explicitily using torch.autograd.grad
    # critical piece is `create_graph=True`
    V = torch.autograd.grad(outputs.loss, lora_param, create_graph=True)
    # ...
    # apply TRCG to take the step
    optimizer.step(minibatch, outputs.loss.item(), V)
    # ...
```
Benchmark results of TRCG vs. AdamW:
- T4 GPU | Dataset oasst2 | Model gpt2
- LoRA config
```
lora_config = LoraConfig(
r=4,
lora_alpha=32, # based on paper - https://arxiv.org/abs/2106.09685
task_type=TaskType.CAUSAL_LM,
lora_dropout=0.05
)
```
- modest level of hyper-parameter search for AdamW
- TRCG does not need any hyper-parameter tuning! 💪
- # of gradient and/or Hessian vector products shown in right figure tells TRCG is much more expensive than AdamW. (Well, that isn't too bad given TRCG is 2nd order method. It doesn't need tuning, right?)

Known Issue:

# this ensures trust region radius won't shrink to zero
optimizer.radius *= 2.0

In pratice, for a stochastic (highly) nonconvex setting, some tricks need to be applied to TRCG such as the above line. It ensures TRCG perform meaningful steps instead of getting stuck ( a side-effect of being stochastic). BUT, that is it. Nothing more! 😼

# we first tried maintaining computing graph on 4bit model
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype="float16",
)
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2",
                                             torch_dtype=torch.float16,
                                             device_map="auto", 
                                             quantization_config=config)
# With the following line, we upcast the model parms to float32 for loRA params
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False )

Non-float32 params makes second backward pass unstable, and this is an issue for most optimizers working on fp32 precision.

Some thoughts

As we continue to expand the algorithms, we aim to provide easy and simple implementations and running examples on using adaptive algorithms, which might be beyond just Hessian-free methods.

For first-order methods, we qualify an algorithm as adaptive if the tuning efforts of critical hyperparameters are nearly zero, such as ones with no tuning required on learning rates.

e.g., Adam is not really a fully-adaptive algorithm ... simply because a global learning rate has to be supplied. It is known that one has to search for good learning-rate + some nice schedulers.

Second-order methods, in particular, Hessian-free methods, are known to exploit the loss lanscape with assistance of second-order information. With recent development in memory/cost-efficient fine-tuning mechanisms of LLMs, e.g., LoRA and quantization, it becomes possible to validate "maybe we can try to use Hessian-free methods to fine-tune some LLMs?"

2nd-order optimizers	tuning learning-rate?	(Theory) general stochastic nonconvex setting?
BFGS	Yes	Yes
TRCG	No	No
Subsample Newton	Yes	Yes

Requirement:

# main packages
pytorch==2.2.2
transformers==4.39.3
bitsandbytes==0.43.0 
peft==0.10.0

# system
os==linux # tested on RHEL, should work for most Linux dist
python>=3.10 
cuda==12.x # 11.8 or above should all work fine
GPU: T4, V100, etc w/ 16GB # at most as old as these guys

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

✨ Fine-tuning LLMs with `LoRA` and `Hessian-free` optimizers!

download project

Demo Notebooks available

Current status:

Known Issue:

Some thoughts

Requirement:

Files

README.md

Latest commit

History

README.md

File metadata and controls

✨ Fine-tuning LLMs with LoRA and Hessian-free optimizers!

download project

Demo Notebooks available

Current status:

Known Issue:

Some thoughts

Requirement:

✨ Fine-tuning LLMs with `LoRA` and `Hessian-free` optimizers!