ExLlama2 vs HuggingFace AutoGPTQ #19

ericborgos · 2023-09-13T10:47:46Z

ericborgos
Sep 13, 2023

How does ExLlamaV2 (or V1) compare to HuggingFace AutoGPTQ (https://huggingface.co/blog/gptq-integration) in terms of speed/capacity to have a lot of users chatting with it at the same time? I am running a GPU server with 16GB of VRAM but could upgrade if needed. My chat site will get several thousand visitors a day, so there could be 25-50 concurrent chats. I am using Llama-13B (https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ) and it is very fast but I haven't officially launched my site yet so there are no concurrent users, just me testing it.

Transformers AutoGPTQ runs via their standard Text-Generation-Inference program (https://huggingface.co/blog/gptq-integration#running-gptq-models-through-text-generation-inference) which means it has "dynamic batching, paged attention and flash attention". Those sound good but I have no idea if they help for what I need.

Is Transformers AutoGPTQ better suited to what I need?

turboderp · 2023-09-13T12:03:43Z

turboderp
Sep 13, 2023
Maintainer

Not sure what dynamic batching and paged attention cover exactly, it sounds like there's some overlap there.

But anyway, if you just run an LLM naively with batched inference, to support up to 50 concurrent users you need a batch size of 50. If you also want to support up to 2048 tokens of context, the cache has to be 2048 positions wide. For Llama-13B that means you'll need about 42 GB of VRAM for the cache alone.

Now, ExLlamaV2 does allow dynamic batching also. You can pass a list of caches with batch size 1 and still run them all as a batch. This way you're not wasting time doing inference on padding tokens, and you can add a sequence to the batch in the middle of another sequence, and so on.

The feature is still in the works, though, and currently it's implemented the "dumb" way by doing batched linear layers along with split attention. So it's not as efficient as it could be. Also there's no inference server yet to manage it all. So while you can absolutely do the sort of thing you're after with ExLlamaV2, it would be a very bare-bones approach compared to TGI. The raw performance may be better, but it's not going to be orders-of-magnitude better in any case, and you won't be able to scale it the same way as you could TGI, just by throwing more GPUs at it. More likely, the performance will be about the same anyway, once you reach a certain batch size.

I would actually even question using quantization for this in the first place, since it introduces an element of complexity to the system which could become arbitrarily expensive down the line, whereas the extra cost of using a full-precision model is predictable.

Short version, I guess, is start with TGI. It's pretty much designed for exactly what you're doing. There's also vLLM which is Apache 2.0 licensed if that suits your requirements better.

3 replies

ericborgos Sep 13, 2023
Author

Thanks, all that info is very helpful.

ericborgos Sep 13, 2023
Author

In terms of the batch size though, to use less RAM, can't I use a batch size of less than 50 for 50 concurrent users? I don't need super fast speeds, so couldn't a batch size of 10 work for 50 users where it just slows down a little? And even if 50 people are using it at the same time, they are not all hitting submit at the same time for the LLM to do processing.

turboderp Sep 13, 2023
Maintainer

If you only need to serve 10 of those 50 users at once, then yes, you can allocate entries in the batch to a queue of incoming requests. You can offload inactive users' caches to system memory (i.e. while they're still reading the last reply or typing), and you can also use dynamic batching to make better use of VRAM, since not all users will need the full context all the time. You're still looking at spending much more of your VRAM on caches than on weights, though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExLlama2 vs HuggingFace AutoGPTQ #19

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

ExLlama2 vs HuggingFace AutoGPTQ #19

ericborgos Sep 13, 2023

Replies: 1 comment · 3 replies

turboderp Sep 13, 2023 Maintainer

ericborgos Sep 13, 2023 Author

ericborgos Sep 13, 2023 Author

turboderp Sep 13, 2023 Maintainer

ericborgos
Sep 13, 2023

Replies: 1 comment 3 replies

turboderp
Sep 13, 2023
Maintainer

ericborgos Sep 13, 2023
Author

ericborgos Sep 13, 2023
Author

turboderp Sep 13, 2023
Maintainer