How many model inference can Triton achieve concurrently and how to increase the number? #7615

wwdok · 2024-09-13T13:18:36Z

wwdok
Sep 13, 2024

Hello everyone,

I’m currently exploring the feasibility and value of using Triton for our inference workloads and would love to hear about your use cases. Specifically, I’m interested in how Triton performs in terms of GPU memory usage and inference capacity, that is to say, I want to know 3 numbers: how many model(what is its size) can excute concurrently on your gpu(what is its memory).

Currently, I am using A30 GPUs with 24GB GPU memory on Alibaba Cloud servers. My model(is wav2lip， a CNN model) are around 200MB, and Alibaba Cloud provide the solution of GPU virtualization to make multiple models inference concurently , but it can only run 5 models simultaneously on A30 GPU with onnxruntime, which results in to high inference costs. I hope to run as many models as possiple, now I’m considering Triton as a potential alternative to reduce the inference costs.

Could you please share your Triton use cases, especially regarding how many models with what size you can run simultaneously on GPU of what size memory ？And some tricks about how to increase the number of models that Triton can infer in parallel ？ For example, I guess maybe switch from onnxruntime to tensorrt, use fp16 or set some hyperparameter etc?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How many model inference can Triton achieve concurrently and how to increase the number? #7615

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

How many model inference can Triton achieve concurrently and how to increase the number? #7615

wwdok Sep 13, 2024

Replies: 0 comments

wwdok
Sep 13, 2024