EoRA is now seamlessly integrated into GPTQModel(HERE), Check here for detailed instructions on running EoRA with GPTQModel.
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
# pip: compile and install
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas,ipex,auto_round]
pip install -v . --no-build-isolation
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "meta-llama/Llama-3.2-3B"
quant_path = "Llama-3.2-3B-gptqmodel-4bit"
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))["text"]
quant_config = QuantizeConfig(bits=4, group_size=128)
model = GPTQModel.load(model_id, quant_config)
# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=1)
model.save(quant_path)
from gptqmodel.adapter.adapter import Lora
from gptqmodel import GPTQModel, QuantizeConfig
eora = Lora(
# for eora generation, path is adapter save path; for load, it is loading path
path=f"{quant_path}/eora_rank16",
rank=16,
)
# provide a previously gptq quantized model path
GPTQModel.adapter.generate(
adapter=eora,
model_id_or_path=model_id,
quantized_model_id_or_path=quant_path,
calibration_dataset=calibration_dataset,
calibration_dataset_concat_size=0,
auto_gc=False)
# post-eora inference
model = GPTQModel.load(
model_id_or_path=quant_path,
adapter=eora
)
python GPTQModel/examples/eora/evaluation.py --quantized_model quant_path
python GPTQModel/examples/eora/evaluation.py --quantized_model quant_path \
--eora_save_path {quant_path}/eora_rank32 \
--eora_rank 16
You can find full reproduction instructions in the EoRA directory.
Shih-Yang Liu*, Maksim Khadkevich, Nai Chit FUNG, Charbel Sakr, Chao-Han Huck Yang,Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen
(*Work done during the internship at NVIDIA Research)
EoRA projects the compression error into the eigenspace of input activations and performs low-rank approximation for compensating the compressed model.
For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.
- [24.02.2025] 🔥🔥 EoRA has been integrated into GPTQModel HERE!!
- [13.06.2025] 🔥🔥 Release the code for reproducing the paper's results!!
Shih-Yang Liu: shihyangl@nvidia.com or sliuau@connect.ust.hk
If you find EoRA useful, please consider giving a star and citation:
@article{liu2024eora,
title={Eora: Training-free compensation for compressed llm with eigenspace low-rank approximation},
author={Liu, Shih-Yang and Khadkevich, Maksim and Fung, Nai Chit and Sakr, Charbel and Yang, Chao-Han Huck and Wang, Chien-Yi and Muralidharan, Saurav and Yin, Hongxu and Cheng, Kwang-Ting and Kautz, Jan and others},
journal={arXiv preprint arXiv:2410.21271},
year={2024}
}
Copyright © 2025, NVIDIA Corporation. All rights reserved.
This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.