ExLlama2 vs HuggingFace AutoGPTQ #19
Replies: 1 comment 3 replies
-
Not sure what dynamic batching and paged attention cover exactly, it sounds like there's some overlap there. But anyway, if you just run an LLM naively with batched inference, to support up to 50 concurrent users you need a batch size of 50. If you also want to support up to 2048 tokens of context, the cache has to be 2048 positions wide. For Llama-13B that means you'll need about 42 GB of VRAM for the cache alone. Now, ExLlamaV2 does allow dynamic batching also. You can pass a list of caches with batch size 1 and still run them all as a batch. This way you're not wasting time doing inference on padding tokens, and you can add a sequence to the batch in the middle of another sequence, and so on. The feature is still in the works, though, and currently it's implemented the "dumb" way by doing batched linear layers along with split attention. So it's not as efficient as it could be. Also there's no inference server yet to manage it all. So while you can absolutely do the sort of thing you're after with ExLlamaV2, it would be a very bare-bones approach compared to TGI. The raw performance may be better, but it's not going to be orders-of-magnitude better in any case, and you won't be able to scale it the same way as you could TGI, just by throwing more GPUs at it. More likely, the performance will be about the same anyway, once you reach a certain batch size. I would actually even question using quantization for this in the first place, since it introduces an element of complexity to the system which could become arbitrarily expensive down the line, whereas the extra cost of using a full-precision model is predictable. Short version, I guess, is start with TGI. It's pretty much designed for exactly what you're doing. There's also vLLM which is Apache 2.0 licensed if that suits your requirements better. |
Beta Was this translation helpful? Give feedback.
-
How does ExLlamaV2 (or V1) compare to HuggingFace AutoGPTQ (https://huggingface.co/blog/gptq-integration) in terms of speed/capacity to have a lot of users chatting with it at the same time? I am running a GPU server with 16GB of VRAM but could upgrade if needed. My chat site will get several thousand visitors a day, so there could be 25-50 concurrent chats. I am using Llama-13B (https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ) and it is very fast but I haven't officially launched my site yet so there are no concurrent users, just me testing it.
Transformers AutoGPTQ runs via their standard Text-Generation-Inference program (https://huggingface.co/blog/gptq-integration#running-gptq-models-through-text-generation-inference) which means it has "dynamic batching, paged attention and flash attention". Those sound good but I have no idea if they help for what I need.
Is Transformers AutoGPTQ better suited to what I need?
Beta Was this translation helpful? Give feedback.
All reactions