Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues on AWQ and Lora #611

Open
2 of 4 tasks
dumbPy opened this issue Sep 18, 2024 · 0 comments
Open
2 of 4 tasks

Performance issues on AWQ and Lora #611

dumbPy opened this issue Sep 18, 2024 · 0 comments

Comments

@dumbPy
Copy link

dumbPy commented Sep 18, 2024

System Info

docker image: ghcr.io/predibase/lorax:07addea because main image isn't working on latest drivers
device: Nvidia A100 80GB
models in use: meta-llama/Meta-Llama-3.1-8B-Instruct and hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
loras finetuned with LLaMA-Factory

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

When testing with a batch of 6 concurrent requests,

  1. Base model meta-llama/Meta-Llama-3.1-8B-Instruct takes 17-20 ms/token i.e. ~ 55 tokens/sec
  2. AWQ quantized model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 takes 23-27 ms/token ~ 48 tokens/sec
  3. With 3 loras (2 request for each lora), the above base model takes 38-46 ms/token ~ 25 tokens/sec

Expected behavior

  1. AWQ quantized model are slower than base models.:
    I was expecting it to be at least faster than the base model, but it's rather slower. The memory footprint is smaller though. The base model took 20.2 GB to load while the AWQ model took 12.2GB (before the rest of the memory was mostly reserved). For the same models (base, awq quantied), the throughput on sglang (with cuda graph, radix and marlin_awq kernel) are 78 tokens/sec and 150 tokens/sec respectively, compared to 55 and 48 tokens/sec here.

  2. Loras are almost twice as slow as the base model. :
    I was expecting it to be slower than base, but doing 30 tokens/sec (when base does 60 tokens/sec) on Nvidia A100 for Llama-3.1-8B-Instruct + lora was surprising. sglang also added lora support recently, but it doesn't support any optimizations and givens an abysmal 10 tokens/sec with loras.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant