Performance issues on AWQ and Lora #611

dumbPy · 2024-09-18T19:06:34Z

System Info

docker image: ghcr.io/predibase/lorax:07addea because main image isn't working on latest drivers
device: Nvidia A100 80GB
models in use: meta-llama/Meta-Llama-3.1-8B-Instruct and hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
loras finetuned with LLaMA-Factory

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

When testing with a batch of 6 concurrent requests,

Base model meta-llama/Meta-Llama-3.1-8B-Instruct takes 17-20 ms/token i.e. ~ 55 tokens/sec
AWQ quantized model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 takes 23-27 ms/token ~ 48 tokens/sec
With 3 loras (2 request for each lora), the above base model takes 38-46 ms/token ~ 25 tokens/sec

Expected behavior

AWQ quantized model are slower than base models.:
I was expecting it to be at least faster than the base model, but it's rather slower. The memory footprint is smaller though. The base model took 20.2 GB to load while the AWQ model took 12.2GB (before the rest of the memory was mostly reserved). For the same models (base, awq quantied), the throughput on sglang (with cuda graph, radix and marlin_awq kernel) are 78 tokens/sec and 150 tokens/sec respectively, compared to 55 and 48 tokens/sec here.
Loras are almost twice as slow as the base model. :
I was expecting it to be slower than base, but doing 30 tokens/sec (when base does 60 tokens/sec) on Nvidia A100 for Llama-3.1-8B-Instruct + lora was surprising. sglang also added lora support recently, but it doesn't support any optimizations and givens an abysmal 10 tokens/sec with loras.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues on AWQ and Lora #611

Performance issues on AWQ and Lora #611

dumbPy commented Sep 18, 2024

Performance issues on AWQ and Lora #611

Performance issues on AWQ and Lora #611

Comments

dumbPy commented Sep 18, 2024

System Info

Information

Tasks

Reproduction

Expected behavior